What is Bias in Machine Learning?
Understand bias in ML: systematic error from wrong assumptions, underfitting, high-bias models, sources of algorithmic bias, and the difference between statistical bias and societal bias.
Two Meanings of Bias
In ML, "bias" has two distinct meanings. Interviewers may mean either ā clarify which one they're asking about.
- Statistical bias (bias-variance tradeoff): systematic error from a model that's too simple
- Algorithmic/societal bias: unfair predictions that disadvantage certain groups
Statistical Bias: Error from Wrong Assumptions
Statistical bias is the systematic error introduced by a model's assumptions about the data. A high-bias model consistently makes the same type of mistake because it can't capture the true pattern.
True pattern: data has a curved (non-linear) relationship
High-bias model: linear regression that assumes straight-line relationship
The model is systematically wrong in the same direction everywhere:
y_true = 3x² + noise
y_predicted = 5x + 2 (linear approximation)
At x=1: true=3, pred=7 ā overestimates
At x=2: true=12, pred=12 ā coincidentally correct
At x=3: true=27, pred=17 ā underestimates
The error isn't random ā it's systematic (bias)High Bias = Underfitting
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error
# True relationship: quadratic
X = np.linspace(-3, 3, 200).reshape(-1, 1)
y_true = 2 * X.flatten() ** 2 + np.random.randn(200) * 0.5
X_train = X[:150]
y_train = y_true[:150]
X_test = X[150:]
y_test = y_true[150:]
# High bias: linear model for quadratic data
linear = LinearRegression()
linear.fit(X_train, y_train)
mse_linear = mean_squared_error(y_test, linear.predict(X_test))
print(f"Linear (high bias) ā Test MSE: {mse_linear:.2f}") # High error
# Low bias: polynomial model
poly = PolynomialFeatures(degree=2)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)
quadratic = LinearRegression()
quadratic.fit(X_train_poly, y_train)
mse_poly = mean_squared_error(y_test, quadratic.predict(X_test_poly))
print(f"Quadratic (low bias) ā Test MSE: {mse_poly:.2f}") # Much lowerMeasuring Bias
Bias is measured on training data ā a high-bias model fails to fit even the data it trained on.
def estimate_bias(model, X_train, y_train) -> float:
"""
Bias = how far the model's average prediction is from the true mean.
High training error ā high bias.
"""
y_pred = model.predict(X_train)
return mean_squared_error(y_train, y_pred)
# Compare bias for different model types
models = {
"Linear": LinearRegression(),
"Degree-2": Pipeline([("poly", PolynomialFeatures(2)), ("lr", LinearRegression())]),
"Degree-5": Pipeline([("poly", PolynomialFeatures(5)), ("lr", LinearRegression())]),
}
for name, model in models.items():
model.fit(X_train, y_train)
bias = estimate_bias(model, X_train, y_train)
print(f"{name}: training MSE (proxy for bias) = {bias:.4f}")
# Linear: high (poor fit to training data)
# Degree-2: low (good fit)
# Degree-5: very low (might overfit ā low bias, high variance)Sources of High Bias in ML Models
| Source | Example | |---|---| | Wrong model family | Linear model for non-linear data | | Too few features | Missing key predictors | | Too much regularization | Lasso zeros out predictive features | | Too shallow model | Decision tree max_depth=1 for complex boundary | | Bad feature engineering | Using raw timestamps instead of time-since-admission |
Algorithmic / Societal Bias
Distinct from statistical bias ā this is when a model learns to make predictions that are systematically unfair to a protected group.
# Example: loan default prediction model
# Training data: historical approvals, which had racial bias
# Model learns that race correlates with default (spurious ā actually correlates with
# zip code, which correlates with race due to redlining history)
# Result: model denies loans at higher rates to minority applicants
# even for applicants with identical credit scores
# Detection: check fairness metrics
def measure_demographic_parity(y_pred, sensitive_attribute) -> dict:
"""
Demographic parity: equal positive prediction rates across groups.
"""
groups = {}
for group in np.unique(sensitive_attribute):
mask = sensitive_attribute == group
approval_rate = y_pred[mask].mean()
groups[str(group)] = round(float(approval_rate), 3)
return groups
# Ideally, approval rates should be similar across demographic groupsStatistical Bias vs Societal Bias ā Key Differences
| Aspect | Statistical Bias | Societal/Algorithmic Bias | |---|---|---| | Cause | Model too simple | Training data reflects historical inequity | | Affects | All predictions systematically | Specific demographic groups | | Fix | More complex model, better features | Fairness constraints, debiased data, different metrics | | Detectable by | High training error | Disparate impact metrics (demographic parity, equalized odds) |
Reducing Statistical Bias
- Use a more complex model ā allow more flexibility
- Add features ā give the model access to the signal it's missing
- Reduce regularization ā penalty is squeezing out predictive capacity
- Feature engineering ā create better representations
Interview Answer Template
Q: What is bias in machine learning?
Bias has two distinct meanings. Statistical bias (from the bias-variance tradeoff) is systematic error from a model that's too simple ā it makes consistent mistakes in the same direction because it can't capture the true pattern. A linear model trained on quadratic data is a classic example of high bias (underfitting). The fix is to add model complexity or better features. Societal/algorithmic bias is a separate concept ā when a model learns to make unfair predictions that disadvantage protected groups, usually because training data reflects historical discrimination. Detection requires fairness metrics like demographic parity or equalized odds. When an interviewer says "bias," I'd clarify which meaning they intend ā they're related but have very different causes and fixes.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.