Learnixo
Back to blog
AI Systemsadvanced

Interview: Bias-Variance Real Scenario

Interview walk-through: diagnose bias-variance problems in a clinical readmission model — with a step-by-step approach covering diagnosis, root cause, targeted fixes, and tradeoff discussion.

Asma Hafeez KhanMay 16, 20267 min read
Machine LearningInterviewBias-Variance TradeoffDiagnosticsClinical AI
Share:š•

The Scenario

You built a logistic regression model to predict 30-day hospital readmission for diabetic patients. On a held-out validation set of 500 patients, the model achieves 61% accuracy — barely better than always predicting "no readmission" (58% of patients aren't readmitted). Your manager asks: is this a bias or variance problem, and what should you do?


Step 1: Gather the Numbers

Before diagnosing, collect the full picture.

Python
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score, classification_report
from sklearn.model_selection import cross_val_score, StratifiedKFold
import numpy as np

# Metrics you need to diagnose
train_acc = accuracy_score(y_train, model.predict(X_train))
val_acc   = accuracy_score(y_val,   model.predict(X_val))
gap       = train_acc - val_acc

train_auc = roc_auc_score(y_train, model.predict_proba(X_train)[:, 1])
val_auc   = roc_auc_score(y_val,   model.predict_proba(X_val)[:, 1])

# Cross-validation for stability estimate
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(model, X_train, y_train, cv=cv, scoring="roc_auc")

print(f"Train accuracy:  {train_acc:.3f}")
print(f"Val accuracy:    {val_acc:.3f}  (baseline: 0.58)")
print(f"Train/val gap:   {gap:.3f}")
print(f"Train AUC:       {train_auc:.3f}")
print(f"Val AUC:         {val_auc:.3f}")
print(f"CV AUC:          {cv_scores.mean():.3f} ± {cv_scores.std():.3f}")

# Scenario numbers:
# train_acc = 0.63, val_acc = 0.61, gap = 0.02
# train_auc = 0.64, val_auc = 0.62
# CV AUC = 0.63 ± 0.02  ← consistent, low std

Step 2: Interpret the Numbers

Train accuracy:  0.63  ← only 5 points above the 0.58 baseline
Val accuracy:    0.61  ← similar to train (gap is small)
Train/val gap:   0.02  ← NOT overfitting
CV AUC:          0.63 ± 0.02  ← stable across folds (low variance)

Conclusion: HIGH BIAS (underfitting)
The model is consistent but consistently wrong — it can't capture the signal.

The key insight: a small train/val gap rules out high variance. The model is performing similarly on train and validation — both are bad. This is the hallmark of high bias.


Step 3: Identify Root Causes

Python
# Root cause investigation

# 1. What features are available?
print(f"Feature matrix shape: {X_train.shape}")
# → (800, 8)  — only 8 features for a complex clinical problem

# 2. What's in the features?
feature_names = ["age", "gender", "num_diagnoses", "num_medications",
                 "prior_admissions", "length_of_stay", "discharge_to",
                 "insurance_type"]
# Missing: HbA1c levels, medication adherence, lab trends, social determinants

# 3. Is the model class too simple?
print(model.get_params())
# LogisticRegression(C=1.0)  ← linear decision boundary
# Readmission is a complex, non-linear phenomenon

# 4. Baseline comparison
from sklearn.dummy import DummyClassifier
baseline = DummyClassifier(strategy="most_frequent")
baseline_cv = cross_val_score(baseline, X_train, y_train, cv=5, scoring="roc_auc")
print(f"Dummy AUC:     {baseline_cv.mean():.3f}")   # ~0.50
print(f"Model AUC:     {cv_scores.mean():.3f}")      # 0.63
# Model beats baseline, but only slightly — signal is there but model can't capture it

Root causes:

  1. Too few features — missing key clinical predictors (HbA1c, lab trends, social factors)
  2. Model too simple — linear boundary for a non-linear problem
  3. Missing feature interactions — age Ɨ medication count, prior admissions Ɨ discharge type

Step 4: Apply Targeted High-Bias Fixes

Python
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score, StratifiedKFold

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Fix 1: Add interaction features (model complexity)
poly_model = Pipeline([
    ("poly", PolynomialFeatures(degree=2, interaction_only=True)),
    ("lr",   LogisticRegression(C=1.0, max_iter=1000)),
])
poly_cv = cross_val_score(poly_model, X_train, y_train, cv=cv, scoring="roc_auc")
print(f"Logistic + interactions: {poly_cv.mean():.3f} ± {poly_cv.std():.3f}")

# Fix 2: Switch to non-linear model
rf = RandomForestClassifier(n_estimators=100, max_depth=6, random_state=42)
rf_cv = cross_val_score(rf, X_train, y_train, cv=cv, scoring="roc_auc")
print(f"Random Forest:           {rf_cv.mean():.3f} ± {rf_cv.std():.3f}")

# Fix 3: Gradient boosting (often best for tabular clinical data)
gbm = GradientBoostingClassifier(
    n_estimators=200,
    max_depth=4,
    learning_rate=0.05,
    subsample=0.8,
    random_state=42,
)
gbm_cv = cross_val_score(gbm, X_train, y_train, cv=cv, scoring="roc_auc")
print(f"Gradient Boosting:       {gbm_cv.mean():.3f} ± {gbm_cv.std():.3f}")

# Fix 4: Reduce regularization on logistic regression
lr_less_reg = LogisticRegression(C=100, max_iter=1000)
lr_cv = cross_val_score(lr_less_reg, X_train, y_train, cv=cv, scoring="roc_auc")
print(f"Logistic C=100:          {lr_cv.mean():.3f} ± {lr_cv.std():.3f}")

Step 5: Add Features (Highest Leverage Fix)

Python
import pandas as pd

# The most reliable fix for high bias: give the model better signal
# Clinical features that predict readmission most reliably:

def engineer_readmission_features(df: pd.DataFrame) -> pd.DataFrame:
    features = df.copy()

    # Medication burden (proxy for disease complexity)
    features["med_burden"] = features["num_medications"] / features["length_of_stay"]

    # Repeat visitor pattern
    features["high_utilizer"] = (features["prior_admissions"] >= 3).astype(int)

    # Discharge risk (home alone = highest risk)
    discharge_risk = {"home": 1, "home_with_help": 0.5, "snf": 0.3, "rehab": 0.2}
    features["discharge_risk"] = features["discharge_to"].map(discharge_risk).fillna(0.5)

    # Age-medication interaction
    features["age_x_meds"] = features["age"] * features["num_medications"]

    # Complex patient indicator
    features["complex"] = (
        (features["num_diagnoses"] >= 5) & (features["num_medications"] >= 10)
    ).astype(int)

    return features

X_engineered = engineer_readmission_features(X_df)

gbm_engineered = cross_val_score(
    GradientBoostingClassifier(n_estimators=200, max_depth=4, random_state=42),
    X_engineered, y_train, cv=cv, scoring="roc_auc"
)
print(f"GBM + engineered features: {gbm_engineered.mean():.3f} ± {gbm_engineered.std():.3f}")

Step 6: Compare Results

Python
results = [
    ("Baseline (majority class)",        0.50),
    ("Logistic Regression (original)",   0.63),
    ("Logistic + interactions",          0.66),
    ("Logistic C=100",                   0.64),
    ("Random Forest",                    0.70),
    ("Gradient Boosting",                0.74),
    ("GBM + engineered features",        0.78),
]

print(f"{'Model':<35} {'CV AUC':<8}")
print("-" * 45)
for name, auc in results:
    bar = "ā–ˆ" * int(auc * 20)
    print(f"{name:<35} {auc:.3f}  {bar}")

# Key: did variance increase as we fixed bias?
# GBM original:              std = 0.03  (acceptable)
# GBM + features:            std = 0.04  (still acceptable)
# If std jumped to 0.10+ → bias fix caused high variance → need regularization

Anticipating the Follow-Up: Did We Introduce Variance?

Python
# After fixing bias, always re-check for variance
def full_diagnosis(model, X, y, label="Model"):
    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    scores = cross_val_score(model, X, y, cv=cv, scoring="roc_auc")

    model.fit(X, y)
    train_auc = roc_auc_score(y, model.predict_proba(X)[:, 1])

    print(f"\n{label}")
    print(f"  Train AUC:      {train_auc:.3f}")
    print(f"  CV AUC:         {scores.mean():.3f} ± {scores.std():.3f}")
    print(f"  Gap:            {train_auc - scores.mean():.3f}")

    if train_auc - scores.mean() > 0.10:
        print("  → HIGH VARIANCE (overfitting): add regularization")
    elif scores.mean() < 0.65:
        print("  → HIGH BIAS: still underfitting")
    else:
        print("  → GOOD FIT")

full_diagnosis(
    GradientBoostingClassifier(n_estimators=200, max_depth=4, random_state=42),
    X_engineered, y_train,
    label="GBM + engineered features"
)

The Tricky Follow-Up Question

Interviewer: "The GBM with engineered features gives CV AUC of 0.78, but train AUC is 0.91. Is that a problem?"

Train AUC:  0.91
CV AUC:     0.78 ± 0.03
Gap:        0.13  ← above the 0.10 threshold → now has HIGH VARIANCE

The high-bias fix introduced some high variance.
This is expected — more complex models overfit more.
Python
# Fix the new variance problem
gbm_regularized = GradientBoostingClassifier(
    n_estimators=200,
    max_depth=3,           # Reduced from 4
    min_samples_leaf=10,   # Added
    subsample=0.7,         # Reduced from 0.8
    learning_rate=0.03,    # Slower
    random_state=42,
)

full_diagnosis(gbm_regularized, X_engineered, y_train, label="GBM regularized")
# Target: train AUC ~0.83, CV AUC ~0.79, gap ~0.04

Interview Summary

Q: You have a clinical model with 63% accuracy vs a 58% baseline. Is this bias or variance?

The key signal is the train/val gap. If training accuracy is also 63% — or close to it — the gap is small, which means the model is consistent across data it has and hasn't seen. That rules out high variance (overfitting). The model is equally bad on training data — the hallmark of high bias (underfitting). My approach: first confirm with CV scores (low std = low variance), then look for root causes — too few features, model too simple, missing interactions. The fix is to increase model capacity: switch from logistic regression to gradient boosting, add engineered features (clinical interactions like age Ɨ medication count), and reduce regularization. After fixing bias, re-check for variance — more complex models can overfit the new feature space. The iterative cycle is: diagnose → fix the dominant problem → re-diagnose → fix again, until both are acceptable.


What Interviewers Want to Hear

  1. Distinguish bias from variance using the train/val gap, not just the absolute score
  2. Connect underfitting to root causes — too simple, too few features, too much regularization
  3. Know the fix order — features first (highest leverage), then model complexity, then regularization tuning
  4. Anticipate the tradeoff — fixing bias may introduce variance; show you'd monitor for it
  5. Use cross-validation — not a single val split — to distinguish true bias from split-specific variance

Enjoyed this article?

Explore the AI Systems learning path for more.

Found this helpful?

Share:š•

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.