Machine Learning Foundations · Lesson 53 of 70

AUC: What It Measures and What It Doesn't

The Probabilistic Definition

AUC has a precise, intuitive interpretation that most people don't know:

AUC-ROC = P(score(positive) > score(negative))

= The probability that a randomly chosen positive example is assigned
  a higher score than a randomly chosen negative example

AUC = 0.50: random — a positive is ranked above a negative 50% of the time
AUC = 0.80: good — 80% of randomly chosen positive-negative pairs are ranked correctly
AUC = 1.00: perfect — every positive is scored above every negative

Demonstrating the Probabilistic Interpretation

Python

import numpy as np
from sklearn.metrics import roc_auc_score

np.random.seed(42)

# Drug-drug interaction model
# Positive: actual interaction, negative: no interaction
y_test   = np.array([0]*180 + [1]*20)
y_proba  = np.random.beta(2, 5, 180).tolist() + np.random.beta(5, 2, 20).tolist()
y_proba  = np.array(y_proba)

# sklearn AUC
auc = roc_auc_score(y_test, y_proba)
print(f"sklearn AUC: {auc:.4f}")

# Manual verification: count pairs where positive > negative
def manual_auc(y_true, y_proba):
    positives = y_proba[y_true == 1]
    negatives = y_proba[y_true == 0]
    correct = 0
    ties = 0
    total = len(positives) * len(negatives)
    for p in positives:
        for n in negatives:
            if p > n:
                correct += 1
            elif p == n:
                ties += 1
    return (correct + 0.5 * ties) / total

print(f"Manual AUC:  {manual_auc(y_test, y_proba):.4f}")
# Should match sklearn — AUC is exactly the probability of correct ranking

Why AUC is Threshold-Independent

Python

# The same probability scores can be interpreted with different thresholds
# AUC measures the quality of the RANKING, not the specific threshold

y_proba_good  = np.array([0.02, 0.05, 0.1, 0.3, 0.8, 0.92, 0.95])
y_proba_bad   = np.array([0.35, 0.4, 0.45, 0.5, 0.55, 0.6, 0.65])
y_true_demo   = np.array([0, 0, 0, 0, 1, 1, 1])

auc_good = roc_auc_score(y_true_demo, y_proba_good)
auc_bad  = roc_auc_score(y_true_demo, y_proba_bad)

print(f"Good model AUC: {auc_good:.3f}  (positives clearly separated)")
print(f"Bad model AUC:  {auc_bad:.3f}   (positives barely above negatives)")

# At threshold 0.5:
for name, proba in [("good", y_proba_good), ("bad", y_proba_bad)]:
    pred = (proba >= 0.5).astype(int)
    from sklearn.metrics import accuracy_score
    print(f"Accuracy at 0.5 threshold ({name}): {accuracy_score(y_true_demo, pred):.3f}")

# Accuracy at threshold 0.5 could be the same even with very different AUCs
# AUC measures the full ranking quality; threshold-specific accuracy measures one point

AUC-ROC vs AUC-PR

Python

from sklearn.metrics import roc_auc_score, average_precision_score
import numpy as np

# Simulate imbalanced clinical dataset: 5% positive (rare adverse event)
np.random.seed(42)
n = 1000
y_true_imb = np.zeros(n, dtype=int)
y_true_imb[:50] = 1   # 5% positive

# Two models with different score distributions
y_proba_model_a = np.concatenate([
    np.random.beta(1, 9, 950),   # negatives: concentrated near 0
    np.random.beta(3, 2, 50),    # positives: somewhat higher
])

y_proba_model_b = np.concatenate([
    np.random.beta(2, 5, 950),   # negatives: slightly less concentrated
    np.random.beta(5, 2, 50),    # positives: clearly higher
])

auc_roc_a = roc_auc_score(y_true_imb, y_proba_model_a)
auc_pr_a  = average_precision_score(y_true_imb, y_proba_model_a)

auc_roc_b = roc_auc_score(y_true_imb, y_proba_model_b)
auc_pr_b  = average_precision_score(y_true_imb, y_proba_model_b)

print("Imbalanced dataset (5% positive):")
print(f"{'':>12}  {'AUC-ROC':>10}  {'AUC-PR':>10}")
print(f"{'Model A':>12}  {auc_roc_a:>10.3f}  {auc_pr_a:>10.3f}")
print(f"{'Model B':>12}  {auc_roc_b:>10.3f}  {auc_pr_b:>10.3f}")

Partial AUC (pAUC): When Only Part of the Curve Matters

Python

from sklearn.metrics import roc_auc_score

# In clinical applications, you often only care about part of the ROC curve
# Example: "We can only tolerate a max false positive rate of 10%"
# → Measure AUC only in the region FPR ∈ [0, 0.10]

# sklearn supports partial AUC via max_fpr parameter
y_proba = model.predict_proba(X_test)[:, 1]

# Full AUC
auc_full = roc_auc_score(y_test, y_proba)

# Partial AUC: only where FPR < 0.10 (high specificity region)
auc_partial = roc_auc_score(y_test, y_proba, max_fpr=0.10)

print(f"Full AUC:           {auc_full:.3f}")
print(f"Partial AUC (FPR<0.10): {auc_partial:.3f}")
# Partial AUC is normalized to [0, 0.1] and then scaled to [0.5, 1] for interpretability
# A model that performs well in the high-specificity region is best for clinical triage

AUC for Multi-Class

Python

from sklearn.metrics import roc_auc_score

# Multi-class: OvR (one vs rest) — compute AUC for each class vs all others
y_proba_multiclass = model.predict_proba(X_test)   # shape: (n_samples, n_classes)

# Average OvR AUC (macro: equal weight per class)
auc_macro = roc_auc_score(y_test_multiclass, y_proba_multiclass, multi_class="ovr", average="macro")
print(f"Multi-class AUC (OvR, macro): {auc_macro:.3f}")

# Weighted OvR AUC
auc_weighted = roc_auc_score(y_test_multiclass, y_proba_multiclass, multi_class="ovr", average="weighted")
print(f"Multi-class AUC (OvR, weighted): {auc_weighted:.3f}")

# OvO (one vs one) — average AUC across all class pairs
auc_ovo = roc_auc_score(y_test_multiclass, y_proba_multiclass, multi_class="ovo", average="macro")
print(f"Multi-class AUC (OvO, macro): {auc_ovo:.3f}")

Explaining AUC to Non-Technical Stakeholders

Python

# Clinical framing (avoids probability language)

def explain_auc_clinically(auc: float, positive_label: str = "high-risk") -> str:
    """
    Translate AUC into a clinically understandable statement.
    """
    pct = int(auc * 100)
    return (
        f"If we randomly pick one {positive_label} patient and one low-risk patient, "
        f"the model correctly identifies which is {positive_label} "
        f"{pct}% of the time.\n"
        f"(A doctor guessing randomly would be correct 50% of the time.)"
    )

print(explain_auc_clinically(0.85, "at risk for readmission"))
print(explain_auc_clinically(0.72, "at risk for bleeding"))

AUC Thresholds for Clinical Acceptance

AUC-ROC interpretation for clinical ML:

Under 0.60:  Worse than or barely better than random — not useful
0.60 – 0.70: Weak discrimination — may be worth exploring, but not deployable
0.70 – 0.80: Moderate — useful as one input to clinical decisions
0.80 – 0.90: Good — strong discriminative ability
Over 0.90:   Excellent — publication-ready; verify for data leakage

For AUC-PR (imbalanced data):
  Compare to baseline: average precision ≈ prevalence (random model)
  If prevalence = 0.05, a good model should achieve AUC-PR > 0.20+

Interview Answer Template

Q: What does AUC actually measure?

AUC-ROC has a precise probabilistic interpretation: it's the probability that a randomly chosen positive example is ranked higher than a randomly chosen negative example. An AUC of 0.85 means the model correctly ranks a random positive-negative pair 85% of the time. This makes AUC threshold-independent — you're measuring the quality of the probability ranking, not the performance at any specific threshold. The key advantage: you can compare models before choosing a deployment threshold. The important limitation: for severely imbalanced datasets (under 10% positive), AUC-ROC can look deceptively good because FPR has a large denominator of true negatives. In those cases, I use AUC-PR (average precision), which only uses TP, FP, and FN — it's entirely focused on the positive class and doesn't benefit from a large true-negative pool.

The ROC Curve Explained

Next Lesson

How to Choose the Right Threshold