Machine Learning Foundations · Lesson 60 of 70
Interview: Regularization in Practice
The Scenario
You're building a logistic regression model to predict whether a patient's warfarin dose needs adjustment (binary: dose_change / no_change). The dataset has 250 patients and 45 features derived from EHR data. The model achieves 94% training accuracy but only 68% validation accuracy. The dataset also has several groups of correlated features (multiple creatinine-based metrics, multiple INR-based metrics). How do you apply regularization?
Step 1: Confirm Overfitting
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.metrics import roc_auc_score
import numpy as np
# No regularization (default solver, very weak regularization — essentially unregularized)
baseline = Pipeline([
("scaler", StandardScaler()),
("model", LogisticRegression(penalty=None, max_iter=2000)),
])
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(baseline, X_train, y_train, cv=cv, scoring="roc_auc")
baseline.fit(X_train, y_train)
train_auc = roc_auc_score(y_train, baseline.predict_proba(X_train)[:, 1])
print(f"Dataset: {X_train.shape[0]} samples, {X_train.shape[1]} features")
print(f"Feature-to-sample ratio: {X_train.shape[1]/X_train.shape[0]:.2f}")
print(f"\nUnregularized model:")
print(f" Train AUC: {train_auc:.3f}")
print(f" CV AUC: {cv_scores.mean():.3f} ± {cv_scores.std():.3f}")
print(f" Gap: {train_auc - cv_scores.mean():.3f} → high variance (overfitting)")Step 2: Choose Between L1, L2, and Elastic Net
# Analysis of the situation:
# - 250 samples, 45 features → ratio 0.18 → regularization required
# - Correlated feature groups (creatinine-based, INR-based) → L2 preferred for stability
# - Unknown whether signal is sparse → Elastic Net as a hedge
# Compare all three
models_to_compare = [
("L2 (Ridge), C=1.0", LogisticRegression(penalty="l2", C=1.0, max_iter=1000)),
("L2 (Ridge), C=0.1", LogisticRegression(penalty="l2", C=0.1, max_iter=1000)),
("L1 (Lasso), C=0.1", LogisticRegression(penalty="l1", C=0.1, solver="liblinear", max_iter=1000)),
("L1 (Lasso), C=1.0", LogisticRegression(penalty="l1", C=1.0, solver="liblinear", max_iter=1000)),
("Elastic Net, l1=0.3", LogisticRegression(penalty="elasticnet", C=0.5, l1_ratio=0.3, solver="saga", max_iter=2000)),
("Elastic Net, l1=0.7", LogisticRegression(penalty="elasticnet", C=0.5, l1_ratio=0.7, solver="saga", max_iter=2000)),
]
print(f"{'Model':<30} {'CV AUC':>8} {'Std':>6} {'Non-zero features':>18}")
print("-" * 68)
for name, model in models_to_compare:
pipe = Pipeline([("scaler", StandardScaler()), ("model", model)])
scores = cross_val_score(pipe, X_train, y_train, cv=cv, scoring="roc_auc")
pipe.fit(X_train, y_train)
coefs = pipe.named_steps["model"].coef_[0]
n_nonzero = (coefs != 0).sum()
print(f"{name:<30} {scores.mean():>8.3f} {scores.std():>6.3f} {n_nonzero:>18}")Step 3: Tune Regularization Strength
from sklearn.model_selection import GridSearchCV
# Given correlated features, L2 is the stable choice
# Tune C over a wide range
pipeline = Pipeline([
("scaler", StandardScaler()),
("model", LogisticRegression(penalty="l2", max_iter=1000)),
])
param_grid = {"model__C": [0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1.0, 3.0, 10.0, 100.0]}
search = GridSearchCV(pipeline, param_grid, cv=cv, scoring="roc_auc", n_jobs=-1)
search.fit(X_train, y_train)
print("Regularization path (L2):")
for params, mean, std in zip(
search.cv_results_["params"],
search.cv_results_["mean_test_score"],
search.cv_results_["std_test_score"],
):
print(f" C={params['model__C']:7}: AUC={mean:.3f} ± {std:.3f}")
print(f"\nBest C: {search.best_params_['model__C']}")
print(f"Best CV AUC: {search.best_score_:.3f}")Step 4: Verify the Fix
from sklearn.metrics import roc_auc_score, classification_report
best_pipeline = search.best_estimator_
best_pipeline.fit(X_train, y_train)
train_auc_reg = roc_auc_score(y_train, best_pipeline.predict_proba(X_train)[:, 1])
val_auc_reg = roc_auc_score(y_val, best_pipeline.predict_proba(X_val)[:, 1])
cv_reg = cross_val_score(best_pipeline, X_train, y_train, cv=cv, scoring="roc_auc")
print("=== Before vs After Regularization ===")
print(f" Train AUC CV AUC Gap")
print(f"No regularization: {train_auc:.3f} {cv_scores.mean():.3f} {train_auc - cv_scores.mean():.3f}")
print(f"L2 (best C): {train_auc_reg:.3f} {cv_reg.mean():.3f} {train_auc_reg - cv_reg.mean():.3f}")
print(f"\nValidation AUC: {val_auc_reg:.3f}")
print(classification_report(y_val, best_pipeline.predict(X_val), target_names=["no_change", "dose_change"]))Step 5: Inspect Coefficients
import numpy as np
coefs = best_pipeline.named_steps["model"].coef_[0]
scaler = best_pipeline.named_steps["scaler"]
# Standardized coefficients: directly comparable in magnitude
print("Top features by regularized coefficient magnitude:")
print(f"{'Feature':<30} {'Coefficient':>12} {'Direction':>10}")
print("-" * 55)
for name, coef in sorted(zip(feature_names, coefs), key=lambda x: abs(x[1]), reverse=True)[:10]:
direction = "→ dose increase" if coef > 0 else "→ dose decrease"
print(f"{name:<30} {coef:>12.4f} {direction:>10}")
# Correlated features (INR-based):
# L2 should distribute weight across them rather than picking one
inr_features = [n for n in feature_names if "inr" in n.lower()]
print(f"\nINR-based feature coefficients (expect distributed, not zeroed):")
for name in inr_features:
idx = feature_names.index(name)
print(f" {name}: {coefs[idx]:.4f}")Explaining to a Non-Technical Stakeholder
# How to communicate regularization to a clinical audience
explanation = """
The first version of the model was "overfitting" — it had memorized
the specific 250 patients in the training set rather than learning
a general rule for warfarin dose adjustment.
Think of it like a medical student who memorizes case studies
verbatim instead of understanding the underlying physiology.
They score 100% on recall, but apply the wrong treatment
to a patient who doesn't match their memorized cases exactly.
The fix (regularization) penalizes the model for being too specific.
It forces the model to find patterns that hold across many patients,
not just the ones it was trained on.
Result: the model's training performance dropped from 94% to 81% —
but its validation performance improved from 68% to 79%.
It's less "impressive" on training data, but actually more useful
for real patients.
"""
print(explanation)What Interviewers Want to Hear
- Diagnose first — confirm overfitting with the train/CV gap, not just training accuracy
- Justify the choice — correlated features → L2 preferred over L1
- Tune with cross-validation — not just pick C=1.0 by default
- Verify the fix — compare train AUC, CV AUC, and gap before and after
- Inspect coefficients — confirm correlated features are distributed (L2) not zeroed
- Clinical translation — be ready to explain regularization without jargon
One-line answer: "High train/val gap with 45 features and 250 patients means the model is overfitting. I'd apply L2 regularization (Ridge) because correlated creatinine and INR feature groups make L1 unstable — it would zero one arbitrarily. I'd tune C by cross-validation, expecting optimal C around 0.1–0.3 for this sample size. After regularization, I'd verify the train/CV gap closes while CV AUC improves."