Machine Learning Foundations · Lesson 29 of 70
How to Balance Bias and Variance in Practice
The Practical Goal
You can't eliminate bias-variance tradeoff — but you can find the point where total error (bias² + variance) is minimized. In practice, this means:
- Diagnose which problem you have (high bias or high variance)
- Apply the right fix
- Measure the result
Diagnosis First
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
import numpy as np
def diagnose_bias_variance(model, X_train, y_train, X_val, y_val) -> str:
"""Quick diagnosis based on train/val gap and absolute performance."""
# Fit the model on full training data
model.fit(X_train, y_train)
train_acc = accuracy_score(y_train, model.predict(X_train))
# Cross-validate to estimate true validation performance with std
from sklearn.model_selection import StratifiedKFold
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(model, X_train, y_train, cv=cv, scoring="accuracy")
val_acc = cv_scores.mean()
val_std = cv_scores.std()
gap = train_acc - val_acc
if train_acc < 0.70:
return f"HIGH BIAS (underfitting) — training accuracy {train_acc:.1%} is too low"
elif gap > 0.12:
return f"HIGH VARIANCE (overfitting) — train {train_acc:.1%}, val {val_acc:.1%}, gap={gap:.1%}"
elif val_std > 0.05:
return f"HIGH VARIANCE — unstable across folds (std={val_std:.1%})"
else:
return f"GOOD FIT — train {train_acc:.1%}, val {val_acc:.1%}, gap={gap:.1%}"If You Have High Bias
The model is too simple. Increase its capacity.
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
# Step 1: Make the model more expressive
models_in_order = [
("Logistic Regression", LogisticRegression()),
("Logistic + Poly Features", Pipeline([
("poly", PolynomialFeatures(2)),
("lr", LogisticRegression()),
])),
("Decision Tree d=5", DecisionTreeClassifier(max_depth=5)),
("Gradient Boosting", GradientBoostingClassifier(max_depth=3)),
]
for name, model in models_in_order:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X_train, y_train, cv=5, scoring="accuracy")
print(f"{name}: {scores.mean():.3f} ± {scores.std():.3f}")
# Stop when validation accuracy stops improving
# Step 2: Reduce regularization
LogisticRegression(C=100) # More permissive (was C=0.01)If You Have High Variance
The model is too complex. Constrain it.
from sklearn.ensemble import RandomForestClassifier
# Fix 1: Reduce model capacity
from sklearn.tree import DecisionTreeClassifier
constrained_tree = DecisionTreeClassifier(
max_depth=5, # Cap depth
min_samples_leaf=10, # Require larger leaf nodes
min_samples_split=20,
)
# Fix 2: Add regularization
from sklearn.linear_model import LogisticRegression
lr_l2 = LogisticRegression(C=0.01) # Strong L2 regularization
# Fix 3: Use an ensemble (reduces variance without increasing bias much)
rf = RandomForestClassifier(n_estimators=200, max_depth=6, min_samples_leaf=5)
# Bagging averages out individual trees' variance
# Fix 4: More training data
# → Variance decreases as n increases, even without changing the model
# Fix 5: Feature selection — remove noisy, irrelevant features
from sklearn.feature_selection import SelectKBest, f_classif
selector = SelectKBest(f_classif, k=15) # Keep 15 most informative featuresValidation Curves: Finding the Sweet Spot
from sklearn.model_selection import validation_curve
import numpy as np
# Plot: how does performance change as a hyperparameter changes?
param_range = [1, 2, 3, 5, 7, 10, 15, None]
train_scores, val_scores = validation_curve(
DecisionTreeClassifier(random_state=42),
X, y,
param_name="max_depth",
param_values=[d for d in param_range if d is not None],
cv=5,
scoring="accuracy",
)
for i, depth in enumerate([d for d in param_range if d is not None]):
print(f"depth={depth:3}: train={train_scores[i].mean():.3f}, "
f"val={val_scores[i].mean():.3f}, gap={train_scores[i].mean()-val_scores[i].mean():.3f}")
# depth=1: high bias — val is low
# depth=3: sweet spot — val is highest
# depth=10: high variance — large gapRegularization Path
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
Cs = [0.001, 0.01, 0.1, 1, 10, 100, 1000]
results = []
for C in Cs:
model = LogisticRegression(C=C, max_iter=1000)
scores = cross_val_score(model, X, y, cv=5, scoring="roc_auc")
results.append({
"C": C,
"val_auc": scores.mean(),
"val_std": scores.std(),
})
print(f"C={C:6}: AUC={scores.mean():.3f} ± {scores.std():.3f}")
# Best C: highest mean AUC
best = max(results, key=lambda r: r["val_auc"])
print(f"\nBest C: {best['C']} (AUC: {best['val_auc']:.3f})")The Step-by-Step Decision Framework
Step 1: Establish a baseline
→ Train a simple model (logistic regression, decision tree d=3)
→ If val score is acceptable, done
Step 2: Is training accuracy low? (High Bias)
→ Try: more complex model, polynomial features, fewer regularization
→ If training accuracy improves but val doesn't → high variance
Step 3: Is there a large train/val gap? (High Variance)
→ Try: regularization, reduce max_depth, get more data, use ensemble
→ Measure: does the gap shrink while keeping val accuracy acceptable?
Step 4: If both are bad
→ Collect more data (most reliable fix for both)
→ Try a fundamentally different model class
Step 5: Tune with cross-validation
→ Use validation curves or GridSearchCV with CV
→ Pick parameters that maximize mean CV score with acceptable stdThe Role of Data Size
# More data shifts the bias-variance curve:
# - High-variance models become more stable
# - Optimal complexity shifts toward more complex models
# - Irreducible noise stays the same
# Rule of thumb for tabular data:
# < 100 samples: simple models (logistic regression, shallow trees)
# 100-10K samples: gradient boosting, random forest
# > 10K samples: deep neural networks become viable
# Rule of thumb for neural networks:
# < 10K: fine-tune a pre-trained model, don't train from scratch
# > 100K: train from scratch with sufficient regularizationInterview Answer Template
Q: How do you balance bias and variance in practice?
First, I diagnose: is the model underfitting (high bias — training accuracy is low) or overfitting (high variance — large train/val gap)? For high bias, I increase model capacity — try a more complex model, add polynomial features, or reduce regularization. For high variance, I regularize more aggressively, reduce model depth, use ensembles (Random Forest averages out variance), or collect more training data. The key tool is the validation curve: plot performance vs a complexity parameter (like max_depth or regularization strength) to find the sweet spot where validation accuracy is highest. I always measure using cross-validation rather than a single split to get a stable estimate with uncertainty. More data is the most reliable fix for high variance but isn't always available — regularization and ensembles are the practical alternatives.