Learnixo

Machine Learning Foundations · Lesson 29 of 70

How to Balance Bias and Variance in Practice

The Practical Goal

You can't eliminate bias-variance tradeoff — but you can find the point where total error (bias² + variance) is minimized. In practice, this means:

  1. Diagnose which problem you have (high bias or high variance)
  2. Apply the right fix
  3. Measure the result

Diagnosis First

Python
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
import numpy as np

def diagnose_bias_variance(model, X_train, y_train, X_val, y_val) -> str:
    """Quick diagnosis based on train/val gap and absolute performance."""
    # Fit the model on full training data
    model.fit(X_train, y_train)
    train_acc = accuracy_score(y_train, model.predict(X_train))

    # Cross-validate to estimate true validation performance with std
    from sklearn.model_selection import StratifiedKFold
    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    cv_scores = cross_val_score(model, X_train, y_train, cv=cv, scoring="accuracy")
    val_acc = cv_scores.mean()
    val_std = cv_scores.std()

    gap = train_acc - val_acc

    if train_acc < 0.70:
        return f"HIGH BIAS (underfitting) — training accuracy {train_acc:.1%} is too low"
    elif gap > 0.12:
        return f"HIGH VARIANCE (overfitting) — train {train_acc:.1%}, val {val_acc:.1%}, gap={gap:.1%}"
    elif val_std > 0.05:
        return f"HIGH VARIANCE — unstable across folds (std={val_std:.1%})"
    else:
        return f"GOOD FIT — train {train_acc:.1%}, val {val_acc:.1%}, gap={gap:.1%}"

If You Have High Bias

The model is too simple. Increase its capacity.

Python
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

# Step 1: Make the model more expressive
models_in_order = [
    ("Logistic Regression",      LogisticRegression()),
    ("Logistic + Poly Features", Pipeline([
        ("poly", PolynomialFeatures(2)),
        ("lr",   LogisticRegression()),
    ])),
    ("Decision Tree d=5",        DecisionTreeClassifier(max_depth=5)),
    ("Gradient Boosting",        GradientBoostingClassifier(max_depth=3)),
]

for name, model in models_in_order:
    from sklearn.model_selection import cross_val_score
    scores = cross_val_score(model, X_train, y_train, cv=5, scoring="accuracy")
    print(f"{name}: {scores.mean():.3f} ± {scores.std():.3f}")
    # Stop when validation accuracy stops improving

# Step 2: Reduce regularization
LogisticRegression(C=100)   # More permissive (was C=0.01)

If You Have High Variance

The model is too complex. Constrain it.

Python
from sklearn.ensemble import RandomForestClassifier

# Fix 1: Reduce model capacity
from sklearn.tree import DecisionTreeClassifier
constrained_tree = DecisionTreeClassifier(
    max_depth=5,           # Cap depth
    min_samples_leaf=10,   # Require larger leaf nodes
    min_samples_split=20,
)

# Fix 2: Add regularization
from sklearn.linear_model import LogisticRegression
lr_l2 = LogisticRegression(C=0.01)   # Strong L2 regularization

# Fix 3: Use an ensemble (reduces variance without increasing bias much)
rf = RandomForestClassifier(n_estimators=200, max_depth=6, min_samples_leaf=5)
# Bagging averages out individual trees' variance

# Fix 4: More training data
# → Variance decreases as n increases, even without changing the model

# Fix 5: Feature selection — remove noisy, irrelevant features
from sklearn.feature_selection import SelectKBest, f_classif
selector = SelectKBest(f_classif, k=15)   # Keep 15 most informative features

Validation Curves: Finding the Sweet Spot

Python
from sklearn.model_selection import validation_curve
import numpy as np

# Plot: how does performance change as a hyperparameter changes?
param_range = [1, 2, 3, 5, 7, 10, 15, None]

train_scores, val_scores = validation_curve(
    DecisionTreeClassifier(random_state=42),
    X, y,
    param_name="max_depth",
    param_values=[d for d in param_range if d is not None],
    cv=5,
    scoring="accuracy",
)

for i, depth in enumerate([d for d in param_range if d is not None]):
    print(f"depth={depth:3}: train={train_scores[i].mean():.3f}, "
          f"val={val_scores[i].mean():.3f}, gap={train_scores[i].mean()-val_scores[i].mean():.3f}")

# depth=1: high bias  val is low
# depth=3: sweet spot  val is highest
# depth=10: high variance  large gap

Regularization Path

Python
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

Cs = [0.001, 0.01, 0.1, 1, 10, 100, 1000]
results = []

for C in Cs:
    model = LogisticRegression(C=C, max_iter=1000)
    scores = cross_val_score(model, X, y, cv=5, scoring="roc_auc")
    results.append({
        "C": C,
        "val_auc": scores.mean(),
        "val_std": scores.std(),
    })
    print(f"C={C:6}: AUC={scores.mean():.3f} ± {scores.std():.3f}")

# Best C: highest mean AUC
best = max(results, key=lambda r: r["val_auc"])
print(f"\nBest C: {best['C']} (AUC: {best['val_auc']:.3f})")

The Step-by-Step Decision Framework

Step 1: Establish a baseline
  → Train a simple model (logistic regression, decision tree d=3)
  → If val score is acceptable, done

Step 2: Is training accuracy low? (High Bias)
  → Try: more complex model, polynomial features, fewer regularization
  → If training accuracy improves but val doesn't → high variance

Step 3: Is there a large train/val gap? (High Variance)
  → Try: regularization, reduce max_depth, get more data, use ensemble
  → Measure: does the gap shrink while keeping val accuracy acceptable?

Step 4: If both are bad
  → Collect more data (most reliable fix for both)
  → Try a fundamentally different model class

Step 5: Tune with cross-validation
  → Use validation curves or GridSearchCV with CV
  → Pick parameters that maximize mean CV score with acceptable std

The Role of Data Size

Python
# More data shifts the bias-variance curve:
# - High-variance models become more stable
# - Optimal complexity shifts toward more complex models
# - Irreducible noise stays the same

# Rule of thumb for tabular data:
# < 100 samples:    simple models (logistic regression, shallow trees)
# 100-10K samples:  gradient boosting, random forest
# > 10K samples:    deep neural networks become viable

# Rule of thumb for neural networks:
# < 10K:  fine-tune a pre-trained model, don't train from scratch
# > 100K: train from scratch with sufficient regularization

Interview Answer Template

Q: How do you balance bias and variance in practice?

First, I diagnose: is the model underfitting (high bias — training accuracy is low) or overfitting (high variance — large train/val gap)? For high bias, I increase model capacity — try a more complex model, add polynomial features, or reduce regularization. For high variance, I regularize more aggressively, reduce model depth, use ensembles (Random Forest averages out variance), or collect more training data. The key tool is the validation curve: plot performance vs a complexity parameter (like max_depth or regularization strength) to find the sweet spot where validation accuracy is highest. I always measure using cross-validation rather than a single split to get a stable estimate with uncertainty. More data is the most reliable fix for high variance but isn't always available — regularization and ensembles are the practical alternatives.