Interview: ROC-AUC and Threshold Deep Dive
Interview walk-through: explain ROC-AUC to a clinical stakeholder, choose between ROC and PR curves, tune a threshold for a sepsis model, and diagnose a model with excellent AUC but poor real-world recall.
Scenario 1: Explain AUC to a Clinician
"Our model has an AUC of 0.84. What does that mean in practice?"
def explain_auc_to_clinician(auc: float, positive_class: str = "high-risk") -> str:
pct = int(auc * 100)
random_pct = 50
return f"""
AUC = {auc:.2f} means:
If we randomly select one {positive_class} patient and one low-risk patient
and show both to the model, it correctly identifies which is {positive_class}
{pct}% of the time.
A model with no useful information would be right {random_pct}% of the time
(the same as a coin flip between two patients).
So our model is correct {pct - random_pct} percentage points more often than chance.
Clinical implication: the model's risk scores meaningfully separate {positive_class}
from low-risk patients — which patients get flagged depends on the threshold we choose.
"""
print(explain_auc_to_clinician(0.84, "at-risk for readmission"))Scenario 2: ROC vs PR — Which Curve?
"You have a model for detecting rare antibiotic-resistant infections — only 3% of patients have it. Should you use ROC or precision-recall?"
import numpy as np
from sklearn.metrics import roc_auc_score, average_precision_score
# Simulate: 3% positive rate, 1000 patients
np.random.seed(42)
n = 1000
y_true = np.zeros(n, dtype=int)
y_true[:30] = 1 # 3% positive
# Model A: moderate discrimination
y_proba_a = np.concatenate([
np.random.beta(1, 6, 970), # negatives
np.random.beta(3, 2, 30), # positives
])
# Model B: better discrimination
y_proba_b = np.concatenate([
np.random.beta(1, 9, 970),
np.random.beta(7, 1, 30),
])
for name, proba in [("Model A", y_proba_a), ("Model B", y_proba_b)]:
auc_roc = roc_auc_score(y_true, proba)
auc_pr = average_precision_score(y_true, proba)
print(f"{name}: AUC-ROC={auc_roc:.3f}, AUC-PR={auc_pr:.3f}")
print("""
Answer: Use precision-recall for 3% prevalence.
Why:
- ROC looks at FPR = FP/(FP+TN). With 970 negatives, even 100 FPs gives FPR=0.10 (looks OK)
- PR ignores TN entirely — it directly measures: of all alarms, how many are real?
- At 3% prevalence, a random model gets AUC-PR ≈ 0.03 (the baseline)
- AUC-PR requires meaningful lift above that random baseline to be useful
""")Scenario 3: AUC=0.91 But Recall=0.45 — What Went Wrong?
"Our sepsis model has AUC of 0.91, but when deployed it only catches 45% of sepsis cases. The team is confused — isn't 0.91 excellent?"
from sklearn.metrics import roc_auc_score, recall_score, precision_score
import numpy as np
# Reproduce the situation
# AUC is high → model ranks positives above negatives correctly
# Recall is low → at the deployed threshold (0.5), most positives are below 0.5
y_true = np.array([0]*450 + [1]*50) # 10% positive
y_proba = np.concatenate([
np.random.beta(1, 8, 450), # negatives: mostly low scores
np.random.beta(2, 3, 50), # positives: slightly higher but mostly under 0.5
])
auc = roc_auc_score(y_true, y_proba)
y_pred_50 = (y_proba >= 0.50).astype(int)
y_pred_20 = (y_proba >= 0.20).astype(int)
print(f"AUC-ROC: {auc:.3f} — excellent ranking quality")
print(f"\nAt threshold=0.50:")
print(f" Recall: {recall_score(y_true, y_pred_50):.3f} ← only 45% caught")
print(f" Precision: {precision_score(y_true, y_pred_50, zero_division=0):.3f}")
print(f"\nAt threshold=0.20:")
print(f" Recall: {recall_score(y_true, y_pred_20):.3f} ← much better")
print(f" Precision: {precision_score(y_true, y_pred_20, zero_division=0):.3f}")
print("""
Diagnosis: The threshold is wrong, not the model.
AUC=0.91 tells you the model ranks positives above negatives 91% of the time.
But with 10% prevalence, many true positives have scores in the 0.2–0.4 range —
which is still higher than most negatives, making the ranking correct (high AUC),
but the default 0.5 cutoff misses them (low recall).
Fix: tune the threshold on the validation set, not use 0.5 by default.
For a sepsis model, find the threshold achieving recall >= 0.85.
""")Scenario 4: Choosing Between Two Models
"Model A has AUC 0.87, Model B has AUC 0.84. We must catch at least 85% of at-risk patients. Which model should we deploy?"
from sklearn.metrics import roc_curve, precision_recall_curve
import numpy as np
# AUC is a global metric — but what matters is performance at the operating point
y_proba_a = model_a.predict_proba(X_val)[:, 1]
y_proba_b = model_b.predict_proba(X_val)[:, 1]
def find_recall_constrained_metrics(y_val, y_proba, target_recall=0.85) -> dict:
from sklearn.metrics import precision_recall_curve
precisions, recalls, thresholds = precision_recall_curve(y_val, y_proba)
for t, p, r in zip(thresholds[::-1], precisions[-2::-1], recalls[-2::-1]):
if r >= target_recall:
return {"threshold": t, "precision": p, "recall": r}
return {"threshold": None, "precision": None, "recall": None}
metrics_a = find_recall_constrained_metrics(y_val, y_proba_a, target_recall=0.85)
metrics_b = find_recall_constrained_metrics(y_val, y_proba_b, target_recall=0.85)
print("At recall >= 0.85:")
print(f" Model A (AUC=0.87): precision={metrics_a['precision']:.3f}, threshold={metrics_a['threshold']:.3f}")
print(f" Model B (AUC=0.84): precision={metrics_b['precision']:.3f}, threshold={metrics_b['threshold']:.3f}")
print("""
Key insight: Model A has higher AUC overall,
but what matters for deployment is precision AT the operating point (recall=0.85).
Model B might have higher precision at that specific operating point even with lower AUC —
because AUC integrates over all thresholds, not just the one you'll deploy.
Decision: compare precision at the required recall, not just AUC.
""")Scenario 5: AUC Drop After Deployment
"Model was validated at AUC 0.89 before deployment. Six months later, AUC dropped to 0.71. What happened?"
# Possible causes and investigation steps
print("=== AUC Degradation Diagnosis ===\n")
print("Possible causes:")
print(" 1. Data drift — patient population has shifted")
print(" (new referring hospitals, seasonal patterns, protocol change)")
print(" 2. Concept drift — relationship between features and readmission has changed")
print(" (COVID protocols, new medications, policy changes)")
print(" 3. Label drift — how readmissions are coded has changed")
print(" 4. Feature pipeline change — upstream data extraction bug")
print(" 5. Class balance shift — readmission rate has changed in the live population")
print()
print("Investigation steps:")
print(" 1. Check feature distributions now vs training time (mean, std, null rates)")
print(" 2. Check label distribution — is the base rate still ~15%?")
print(" 3. Compare score distributions — are probabilities still well-calibrated?")
print(" 4. Run subset analysis — which patient subgroups degraded most?")
print(" 5. Check data pipeline logs — any schema changes or ETL failures?")
print()
# Simple drift detection
def detect_feature_drift(X_train: np.ndarray, X_recent: np.ndarray, feature_names: list) -> None:
"""Compare feature statistics between training and recent production data."""
print("Feature drift check:")
print(f"{'Feature':<25} {'Train mean':>12} {'Recent mean':>12} {'Drift':>10}")
print("-" * 65)
for i, name in enumerate(feature_names):
train_mean = X_train[:, i].mean()
recent_mean = X_recent[:, i].mean()
drift_pct = abs(recent_mean - train_mean) / (abs(train_mean) + 1e-9) * 100
flag = " ⚠" if drift_pct > 20 else ""
print(f"{name:<25} {train_mean:>12.3f} {recent_mean:>12.3f} {drift_pct:>9.1f}%{flag}")What Interviewers Want to Hear
- AUC = ranking quality, not threshold performance — always clarify this distinction
- Choose ROC vs PR based on class imbalance — PR for severe imbalance (under 10-15%)
- Threshold is a separate decision from model selection — tune on val, evaluate on test
- Compare models at the operating point, not just globally — AUC is aggregate; check at target recall
- Monitor AUC post-deployment — explain degradation before blaming the model
Key one-liner: "AUC tells you how well the model ranks positives above negatives — it's model quality. The threshold tells you where to draw the line given your clinical cost structure — that's a deployment decision. Always separate them."
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.