Learnixo
Back to blog
AI Systemsadvanced

Interview: ROC-AUC and Threshold Deep Dive

Interview walk-through: explain ROC-AUC to a clinical stakeholder, choose between ROC and PR curves, tune a threshold for a sepsis model, and diagnose a model with excellent AUC but poor real-world recall.

Asma Hafeez KhanMay 16, 20266 min read
Machine LearningInterviewROCAUCThresholdClinical AI
Share:𝕏

Scenario 1: Explain AUC to a Clinician

"Our model has an AUC of 0.84. What does that mean in practice?"

Python
def explain_auc_to_clinician(auc: float, positive_class: str = "high-risk") -> str:
    pct = int(auc * 100)
    random_pct = 50

    return f"""
AUC = {auc:.2f} means:

If we randomly select one {positive_class} patient and one low-risk patient
and show both to the model, it correctly identifies which is {positive_class}
{pct}% of the time.

A model with no useful information would be right {random_pct}% of the time
(the same as a coin flip between two patients).

So our model is correct {pct - random_pct} percentage points more often than chance.

Clinical implication: the model's risk scores meaningfully separate {positive_class} 
from low-risk patients — which patients get flagged depends on the threshold we choose.
"""

print(explain_auc_to_clinician(0.84, "at-risk for readmission"))

Scenario 2: ROC vs PR — Which Curve?

"You have a model for detecting rare antibiotic-resistant infections — only 3% of patients have it. Should you use ROC or precision-recall?"

Python
import numpy as np
from sklearn.metrics import roc_auc_score, average_precision_score

# Simulate: 3% positive rate, 1000 patients
np.random.seed(42)
n = 1000
y_true = np.zeros(n, dtype=int)
y_true[:30] = 1   # 3% positive

# Model A: moderate discrimination
y_proba_a = np.concatenate([
    np.random.beta(1, 6, 970),   # negatives
    np.random.beta(3, 2, 30),    # positives
])

# Model B: better discrimination
y_proba_b = np.concatenate([
    np.random.beta(1, 9, 970),
    np.random.beta(7, 1, 30),
])

for name, proba in [("Model A", y_proba_a), ("Model B", y_proba_b)]:
    auc_roc = roc_auc_score(y_true, proba)
    auc_pr  = average_precision_score(y_true, proba)
    print(f"{name}: AUC-ROC={auc_roc:.3f}, AUC-PR={auc_pr:.3f}")

print("""
Answer: Use precision-recall for 3% prevalence.

Why:
- ROC looks at FPR = FP/(FP+TN). With 970 negatives, even 100 FPs gives FPR=0.10 (looks OK)
- PR ignores TN entirely — it directly measures: of all alarms, how many are real?
- At 3% prevalence, a random model gets AUC-PR ≈ 0.03 (the baseline)
- AUC-PR requires meaningful lift above that random baseline to be useful
""")

Scenario 3: AUC=0.91 But Recall=0.45 — What Went Wrong?

"Our sepsis model has AUC of 0.91, but when deployed it only catches 45% of sepsis cases. The team is confused — isn't 0.91 excellent?"

Python
from sklearn.metrics import roc_auc_score, recall_score, precision_score
import numpy as np

# Reproduce the situation
# AUC is high  model ranks positives above negatives correctly
# Recall is low  at the deployed threshold (0.5), most positives are below 0.5

y_true = np.array([0]*450 + [1]*50)   # 10% positive
y_proba = np.concatenate([
    np.random.beta(1, 8, 450),   # negatives: mostly low scores
    np.random.beta(2, 3, 50),    # positives: slightly higher but mostly under 0.5
])

auc = roc_auc_score(y_true, y_proba)
y_pred_50  = (y_proba >= 0.50).astype(int)
y_pred_20  = (y_proba >= 0.20).astype(int)

print(f"AUC-ROC: {auc:.3f}  — excellent ranking quality")
print(f"\nAt threshold=0.50:")
print(f"  Recall:    {recall_score(y_true, y_pred_50):.3f}  ← only 45% caught")
print(f"  Precision: {precision_score(y_true, y_pred_50, zero_division=0):.3f}")
print(f"\nAt threshold=0.20:")
print(f"  Recall:    {recall_score(y_true, y_pred_20):.3f}  ← much better")
print(f"  Precision: {precision_score(y_true, y_pred_20, zero_division=0):.3f}")

print("""
Diagnosis: The threshold is wrong, not the model.

AUC=0.91 tells you the model ranks positives above negatives 91% of the time.
But with 10% prevalence, many true positives have scores in the 0.2–0.4 range —
which is still higher than most negatives, making the ranking correct (high AUC),
but the default 0.5 cutoff misses them (low recall).

Fix: tune the threshold on the validation set, not use 0.5 by default.
For a sepsis model, find the threshold achieving recall >= 0.85.
""")

Scenario 4: Choosing Between Two Models

"Model A has AUC 0.87, Model B has AUC 0.84. We must catch at least 85% of at-risk patients. Which model should we deploy?"

Python
from sklearn.metrics import roc_curve, precision_recall_curve
import numpy as np

# AUC is a global metric  but what matters is performance at the operating point

y_proba_a = model_a.predict_proba(X_val)[:, 1]
y_proba_b = model_b.predict_proba(X_val)[:, 1]

def find_recall_constrained_metrics(y_val, y_proba, target_recall=0.85) -> dict:
    from sklearn.metrics import precision_recall_curve
    precisions, recalls, thresholds = precision_recall_curve(y_val, y_proba)
    for t, p, r in zip(thresholds[::-1], precisions[-2::-1], recalls[-2::-1]):
        if r >= target_recall:
            return {"threshold": t, "precision": p, "recall": r}
    return {"threshold": None, "precision": None, "recall": None}

metrics_a = find_recall_constrained_metrics(y_val, y_proba_a, target_recall=0.85)
metrics_b = find_recall_constrained_metrics(y_val, y_proba_b, target_recall=0.85)

print("At recall >= 0.85:")
print(f"  Model A (AUC=0.87): precision={metrics_a['precision']:.3f}, threshold={metrics_a['threshold']:.3f}")
print(f"  Model B (AUC=0.84): precision={metrics_b['precision']:.3f}, threshold={metrics_b['threshold']:.3f}")

print("""
Key insight: Model A has higher AUC overall,
but what matters for deployment is precision AT the operating point (recall=0.85).

Model B might have higher precision at that specific operating point even with lower AUC —
because AUC integrates over all thresholds, not just the one you'll deploy.

Decision: compare precision at the required recall, not just AUC.
""")

Scenario 5: AUC Drop After Deployment

"Model was validated at AUC 0.89 before deployment. Six months later, AUC dropped to 0.71. What happened?"

Python
# Possible causes and investigation steps

print("=== AUC Degradation Diagnosis ===\n")

print("Possible causes:")
print("  1. Data drift — patient population has shifted")
print("     (new referring hospitals, seasonal patterns, protocol change)")
print("  2. Concept drift — relationship between features and readmission has changed")
print("     (COVID protocols, new medications, policy changes)")
print("  3. Label drift — how readmissions are coded has changed")
print("  4. Feature pipeline change — upstream data extraction bug")
print("  5. Class balance shift — readmission rate has changed in the live population")
print()

print("Investigation steps:")
print("  1. Check feature distributions now vs training time (mean, std, null rates)")
print("  2. Check label distribution — is the base rate still ~15%?")
print("  3. Compare score distributions — are probabilities still well-calibrated?")
print("  4. Run subset analysis — which patient subgroups degraded most?")
print("  5. Check data pipeline logs — any schema changes or ETL failures?")
print()

# Simple drift detection
def detect_feature_drift(X_train: np.ndarray, X_recent: np.ndarray, feature_names: list) -> None:
    """Compare feature statistics between training and recent production data."""
    print("Feature drift check:")
    print(f"{'Feature':<25}  {'Train mean':>12}  {'Recent mean':>12}  {'Drift':>10}")
    print("-" * 65)
    for i, name in enumerate(feature_names):
        train_mean  = X_train[:, i].mean()
        recent_mean = X_recent[:, i].mean()
        drift_pct   = abs(recent_mean - train_mean) / (abs(train_mean) + 1e-9) * 100
        flag = "" if drift_pct > 20 else ""
        print(f"{name:<25}  {train_mean:>12.3f}  {recent_mean:>12.3f}  {drift_pct:>9.1f}%{flag}")

What Interviewers Want to Hear

  1. AUC = ranking quality, not threshold performance — always clarify this distinction
  2. Choose ROC vs PR based on class imbalance — PR for severe imbalance (under 10-15%)
  3. Threshold is a separate decision from model selection — tune on val, evaluate on test
  4. Compare models at the operating point, not just globally — AUC is aggregate; check at target recall
  5. Monitor AUC post-deployment — explain degradation before blaming the model

Key one-liner: "AUC tells you how well the model ranks positives above negatives — it's model quality. The threshold tells you where to draw the line given your clinical cost structure — that's a deployment decision. Always separate them."

Enjoyed this article?

Explore the AI Systems learning path for more.

Found this helpful?

Share:𝕏

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.