Interview: Choosing the Right Evaluation Metric

The Framework

Before picking a metric, answer these questions:

1. Are the classes balanced?
   No → accuracy is misleading; use AUC-ROC, AUC-PR, or F1

2. Is one error type more costly?
   FN more costly (miss rate): maximize recall, use F2
   FP more costly (false alarm rate): maximize precision, use F0.5
   Equal cost: use F1

3. Do you need a threshold-independent metric?
   Comparing models before deployment: AUC-ROC or AUC-PR
   Fixed deployment threshold: precision, recall, F1 at that threshold

4. Is the class imbalance severe (under 5%)?
   AUC-PR is better than AUC-ROC (PR doesn't inflate due to TN count)

5. Multi-class?
   Macro F1 (care equally about all classes)
   Weighted F1 (weight by class frequency)
   Per-class F1 (always report alongside aggregate)

Scenario 1: Sepsis Early Warning (ICU)

Python

from sklearn.metrics import recall_score, precision_score, fbeta_score, roc_auc_score

# Class distribution: 8% sepsis (92% non-sepsis)
# Cost asymmetry: missing sepsis = delayed treatment = high mortality
#                 false alarm = extra labs + physician review (manageable)

# Priority: maximize recall (catch as many sepsis cases as possible)

y_proba = model.predict_proba(X_test)[:, 1]

# Find threshold that achieves recall >= 0.90
import numpy as np
from sklearn.metrics import precision_recall_curve

precisions, recalls, thresholds = precision_recall_curve(y_test, y_proba)
for t, p, r in zip(thresholds, precisions, recalls):
    if r >= 0.90:
        print(f"Threshold {t:.3f}: recall={r:.3f}, precision={p:.3f}")
        y_pred_t = (y_proba >= t).astype(int)
        print(f"F2 at this threshold: {fbeta_score(y_test, y_pred_t, beta=2):.3f}")
        break

# Report: AUC-ROC for model comparison, F2 and recall at deployment threshold
print(f"AUC-ROC: {roc_auc_score(y_test, y_proba):.3f}")

Scenario 2: Drug-Drug Interaction Alert

Python

# Class distribution: 3% positive interactions (97% irrelevant)
# Cost asymmetry: FP = alert fatigue (physicians stop responding to alerts)
#                 FN = missed interaction = patient harm
# Both are costly — but alert fatigue is a real risk

# Primary metric: AUC-PR (severe imbalance)
# Deployment: tune threshold for precision-recall balance acceptable to clinical team

from sklearn.metrics import average_precision_score

auc_pr = average_precision_score(y_test, y_proba)
print(f"Average Precision (AUC-PR): {auc_pr:.3f}")

# For a 3% base rate, even AUC-PR of 0.30 represents substantial lift over random

# Deployment decision: clinical team decides acceptable false alarm rate
# e.g., "we can tolerate 1 false alarm per true alert (precision = 0.50)"
for t, p, r in zip(thresholds, precisions, recalls):
    if p >= 0.50:
        print(f"\nAt precision=0.50: threshold={t:.3f}, recall={r:.3f}")
        print("Interpretation: for every real interaction flagged, 1 false alarm fires")
        break

Scenario 3: 30-Day Readmission (Discharge Planning)

Python

# Class distribution: 15% readmitted (85% not)
# Use case: flag high-risk patients for discharge planning intervention
# Cost asymmetry: FN = missed high-risk patient, no intervention → readmission
#                 FP = unnecessary discharge planning (resource cost, not harmful)

# Primary metric: F1 or F2 at deployment threshold
# Model comparison: AUC-ROC

from sklearn.metrics import f1_score, fbeta_score, roc_auc_score, classification_report

y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)[:, 1]

print(f"AUC-ROC:     {roc_auc_score(y_test, y_proba):.3f}")
print(f"F1:          {f1_score(y_test, y_pred):.3f}")
print(f"F2:          {fbeta_score(y_test, y_pred, beta=2):.3f}")
print(classification_report(y_test, y_pred, target_names=["no_readmit", "readmit"]))

# Calibration check: if you flag the top 20% of patients, what's the precision?
n_top = int(0.20 * len(y_test))
top_idx = np.argsort(y_proba)[-n_top:]
top_precision = y_test.iloc[top_idx].mean() if hasattr(y_test, 'iloc') else y_test[top_idx].mean()
print(f"\nPrecision in top 20% by risk score: {top_precision:.3f}")

Scenario 4: Multi-Drug Category Classification

Python

# 4-class: anticoagulant / antidiabetic / antihypertensive / antibiotic
# Class distribution: roughly balanced (25% each in training)
# All classes equally important

from sklearn.metrics import f1_score, classification_report

y_pred = model.predict(X_test)

# Macro F1: best when all classes equally important
# Since balanced: macro ≈ weighted ≈ accuracy

print(f"Macro F1:    {f1_score(y_test, y_pred, average='macro'):.3f}")
print(f"Accuracy:    {(y_test == y_pred).mean():.3f}")  # OK here since balanced
print()
print(classification_report(y_test, y_pred, target_names=["anticoag", "antidiab", "antihyp", "antibiotic"]))

# If one class is rare (e.g., antibiotic = 5%):
# Macro F1 still appropriate — ensures rare class isn't ignored
# Report per-class F1 separately to show rare class performance

Scenario 5: LLM Safety Classifier

Python

# Binary: safe / unsafe content
# Class distribution: 99% safe, 1% unsafe
# Cost asymmetry: FN = harmful output reaches user (severe)
#                 FP = legitimate request blocked (mild inconvenience)

# Primary metric: recall for unsafe class (catch as many harmful outputs as possible)
# Secondary: precision (don't over-block legitimate content)
# Model comparison: AUC-PR (severe imbalance, AUC-ROC can be misleading)

from sklearn.metrics import recall_score, precision_score, average_precision_score

y_proba = safety_model.predict_proba(X_test)[:, 1]  # prob of "unsafe"

print(f"AUC-PR:               {average_precision_score(y_test, y_proba):.3f}")
print(f"AUC-ROC:              {roc_auc_score(y_test, y_proba):.3f}")

# Set threshold to achieve recall >= 0.95 for unsafe class
for t, p, r in zip(thresholds, precisions, recalls):
    if r >= 0.95:
        y_pred_t = (y_proba >= t).astype(int)
        print(f"\nThreshold for recall=0.95: {t:.3f}")
        print(f"  Precision: {p:.3f}  (FP rate: {1-p:.3f})")
        print(f"  Blocking {y_pred_t.sum()/len(y_pred_t)*100:.1f}% of all inputs")
        break

# Accept: some legitimate content gets blocked (FP)
# Do not accept: any harmful content through (FN) — especially for medical/financial content

The Metric Pitfall: Only Reporting One Number

Python

# Interviewers watch for this mistake

# WRONG: "Our model has AUC-ROC of 0.92 so it works great"
# Missing: per-class F1, precision/recall at deployment threshold,
#          baseline comparison, calibration

# CORRECT answer structure:
def report_model_performance(y_test, y_pred, y_proba, class_names=None):
    from sklearn.metrics import (
        roc_auc_score, average_precision_score, f1_score,
        classification_report, precision_score, recall_score
    )
    from sklearn.dummy import DummyClassifier
    import numpy as np

    print("=== Model Evaluation ===")
    print(f"Baseline (majority class):     {max(y_test.mean(), 1-y_test.mean()):.3f} accuracy")
    print(f"AUC-ROC:                       {roc_auc_score(y_test, y_proba):.3f}")
    print(f"AUC-PR (avg precision):        {average_precision_score(y_test, y_proba):.3f}")
    print(f"F1 (minority class):           {f1_score(y_test, y_pred):.3f}")
    print(f"Precision:                     {precision_score(y_test, y_pred):.3f}")
    print(f"Recall:                        {recall_score(y_test, y_pred):.3f}")
    print()
    print(classification_report(y_test, y_pred, target_names=class_names))

Interview Summary

Q: How do you choose the right evaluation metric?

The choice flows from three questions: Are classes balanced? Is one error type more costly? Do I need a threshold-independent metric for model comparison? For imbalanced clinical data, accuracy is almost always wrong — the majority-class baseline achieves high accuracy by design. For model comparison before setting a threshold, AUC-ROC works for moderate imbalance; AUC-PR is better for severe imbalance (under 5% positive). At deployment, I pick the threshold based on clinical cost: for sepsis detection, I find the threshold achieving recall ≥ 0.90 and report F2. For alert systems where fatigue matters, I find the precision-acceptable threshold and report F0.5 or precision alone. I always report per-class metrics alongside the aggregate, and I always compare against a majority-class dummy baseline — a model that doesn't beat the dummy hasn't learned anything useful.

Interview: Choosing the Right Evaluation Metric

The Framework

Scenario 1: Sepsis Early Warning (ICU)

Scenario 2: Drug-Drug Interaction Alert

Scenario 3: 30-Day Readmission (Discharge Planning)

Scenario 4: Multi-Drug Category Classification

Scenario 5: LLM Safety Classifier

The Metric Pitfall: Only Reporting One Number

Interview Summary

Enjoyed this article?

Leave a comment