Machine Learning Foundations · Lesson 60 of 70

Interview: Regularization in Practice

The Scenario

You're building a logistic regression model to predict whether a patient's warfarin dose needs adjustment (binary: dose_change / no_change). The dataset has 250 patients and 45 features derived from EHR data. The model achieves 94% training accuracy but only 68% validation accuracy. The dataset also has several groups of correlated features (multiple creatinine-based metrics, multiple INR-based metrics). How do you apply regularization?

Step 1: Confirm Overfitting

Python

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.metrics import roc_auc_score
import numpy as np

# No regularization (default solver, very weak regularization — essentially unregularized)
baseline = Pipeline([
    ("scaler", StandardScaler()),
    ("model", LogisticRegression(penalty=None, max_iter=2000)),
])

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(baseline, X_train, y_train, cv=cv, scoring="roc_auc")

baseline.fit(X_train, y_train)
train_auc = roc_auc_score(y_train, baseline.predict_proba(X_train)[:, 1])

print(f"Dataset: {X_train.shape[0]} samples, {X_train.shape[1]} features")
print(f"Feature-to-sample ratio: {X_train.shape[1]/X_train.shape[0]:.2f}")
print(f"\nUnregularized model:")
print(f"  Train AUC:  {train_auc:.3f}")
print(f"  CV AUC:     {cv_scores.mean():.3f} ± {cv_scores.std():.3f}")
print(f"  Gap:        {train_auc - cv_scores.mean():.3f}  → high variance (overfitting)")

Step 2: Choose Between L1, L2, and Elastic Net

Python

# Analysis of the situation:
# - 250 samples, 45 features → ratio 0.18 → regularization required
# - Correlated feature groups (creatinine-based, INR-based) → L2 preferred for stability
# - Unknown whether signal is sparse → Elastic Net as a hedge

# Compare all three
models_to_compare = [
    ("L2 (Ridge), C=1.0",     LogisticRegression(penalty="l2", C=1.0, max_iter=1000)),
    ("L2 (Ridge), C=0.1",     LogisticRegression(penalty="l2", C=0.1, max_iter=1000)),
    ("L1 (Lasso), C=0.1",     LogisticRegression(penalty="l1", C=0.1, solver="liblinear", max_iter=1000)),
    ("L1 (Lasso), C=1.0",     LogisticRegression(penalty="l1", C=1.0, solver="liblinear", max_iter=1000)),
    ("Elastic Net, l1=0.3",   LogisticRegression(penalty="elasticnet", C=0.5, l1_ratio=0.3, solver="saga", max_iter=2000)),
    ("Elastic Net, l1=0.7",   LogisticRegression(penalty="elasticnet", C=0.5, l1_ratio=0.7, solver="saga", max_iter=2000)),
]

print(f"{'Model':<30}  {'CV AUC':>8}  {'Std':>6}  {'Non-zero features':>18}")
print("-" * 68)

for name, model in models_to_compare:
    pipe = Pipeline([("scaler", StandardScaler()), ("model", model)])
    scores = cross_val_score(pipe, X_train, y_train, cv=cv, scoring="roc_auc")
    pipe.fit(X_train, y_train)
    coefs = pipe.named_steps["model"].coef_[0]
    n_nonzero = (coefs != 0).sum()
    print(f"{name:<30}  {scores.mean():>8.3f}  {scores.std():>6.3f}  {n_nonzero:>18}")

Step 3: Tune Regularization Strength

Python

from sklearn.model_selection import GridSearchCV

# Given correlated features, L2 is the stable choice
# Tune C over a wide range

pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("model", LogisticRegression(penalty="l2", max_iter=1000)),
])

param_grid = {"model__C": [0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1.0, 3.0, 10.0, 100.0]}
search = GridSearchCV(pipeline, param_grid, cv=cv, scoring="roc_auc", n_jobs=-1)
search.fit(X_train, y_train)

print("Regularization path (L2):")
for params, mean, std in zip(
    search.cv_results_["params"],
    search.cv_results_["mean_test_score"],
    search.cv_results_["std_test_score"],
):
    print(f"  C={params['model__C']:7}: AUC={mean:.3f} ± {std:.3f}")

print(f"\nBest C: {search.best_params_['model__C']}")
print(f"Best CV AUC: {search.best_score_:.3f}")

Step 4: Verify the Fix

Python

from sklearn.metrics import roc_auc_score, classification_report

best_pipeline = search.best_estimator_
best_pipeline.fit(X_train, y_train)

train_auc_reg = roc_auc_score(y_train, best_pipeline.predict_proba(X_train)[:, 1])
val_auc_reg   = roc_auc_score(y_val, best_pipeline.predict_proba(X_val)[:, 1])

cv_reg = cross_val_score(best_pipeline, X_train, y_train, cv=cv, scoring="roc_auc")

print("=== Before vs After Regularization ===")
print(f"                    Train AUC    CV AUC     Gap")
print(f"No regularization:  {train_auc:.3f}        {cv_scores.mean():.3f}      {train_auc - cv_scores.mean():.3f}")
print(f"L2 (best C):        {train_auc_reg:.3f}        {cv_reg.mean():.3f}      {train_auc_reg - cv_reg.mean():.3f}")
print(f"\nValidation AUC: {val_auc_reg:.3f}")
print(classification_report(y_val, best_pipeline.predict(X_val), target_names=["no_change", "dose_change"]))

Step 5: Inspect Coefficients

Python

import numpy as np

coefs = best_pipeline.named_steps["model"].coef_[0]
scaler = best_pipeline.named_steps["scaler"]

# Standardized coefficients: directly comparable in magnitude
print("Top features by regularized coefficient magnitude:")
print(f"{'Feature':<30}  {'Coefficient':>12}  {'Direction':>10}")
print("-" * 55)
for name, coef in sorted(zip(feature_names, coefs), key=lambda x: abs(x[1]), reverse=True)[:10]:
    direction = "→ dose increase" if coef > 0 else "→ dose decrease"
    print(f"{name:<30}  {coef:>12.4f}  {direction:>10}")

# Correlated features (INR-based):
# L2 should distribute weight across them rather than picking one
inr_features = [n for n in feature_names if "inr" in n.lower()]
print(f"\nINR-based feature coefficients (expect distributed, not zeroed):")
for name in inr_features:
    idx = feature_names.index(name)
    print(f"  {name}: {coefs[idx]:.4f}")

Explaining to a Non-Technical Stakeholder

Python

# How to communicate regularization to a clinical audience

explanation = """
The first version of the model was "overfitting" — it had memorized 
the specific 250 patients in the training set rather than learning 
a general rule for warfarin dose adjustment.

Think of it like a medical student who memorizes case studies 
verbatim instead of understanding the underlying physiology. 
They score 100% on recall, but apply the wrong treatment 
to a patient who doesn't match their memorized cases exactly.

The fix (regularization) penalizes the model for being too specific.
It forces the model to find patterns that hold across many patients,
not just the ones it was trained on.

Result: the model's training performance dropped from 94% to 81% — 
but its validation performance improved from 68% to 79%.
It's less "impressive" on training data, but actually more useful 
for real patients.
"""
print(explanation)

What Interviewers Want to Hear

Diagnose first — confirm overfitting with the train/CV gap, not just training accuracy
Justify the choice — correlated features → L2 preferred over L1
Tune with cross-validation — not just pick C=1.0 by default
Verify the fix — compare train AUC, CV AUC, and gap before and after
Inspect coefficients — confirm correlated features are distributed (L2) not zeroed
Clinical translation — be ready to explain regularization without jargon

One-line answer: "High train/val gap with 45 features and 250 patients means the model is overfitting. I'd apply L2 regularization (Ridge) because correlated creatinine and INR feature groups make L1 unstable — it would zero one arbitrarily. I'd tune C by cross-validation, expecting optimal C around 0.1–0.3 for this sample size. After regularization, I'd verify the train/CV gap closes while CV AUC improves."

L1 vs L2: When to Use Each

Next Lesson

Hyperparameter vs Parameter: The Difference