Sensitivity and Specificity

The Clinical Framing

Sensitivity and specificity are the clinical equivalents of recall and (1 - false positive rate). They appear constantly in medical literature, FDA submissions, and clinical validation studies.

Sensitivity = TP / (TP + FN)  =  Recall
  "Of all patients WHO HAVE the condition, how many did we detect?"

Specificity = TN / (TN + FP)  =  1 - False Positive Rate
  "Of all patients WHO DON'T HAVE the condition, how many did we correctly clear?"

Computing Both

Python

from sklearn.metrics import confusion_matrix, recall_score
import numpy as np

# Warfarin bleeding risk model
y_true = np.array([0]*170 + [1]*30)
y_pred = np.array([0]*158 + [1]*12 + [0]*7 + [1]*23)

tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()

sensitivity = tp / (tp + fn)   # same as recall
specificity = tn / (tn + fp)

print(f"Sensitivity: {sensitivity:.3f}  — {tp}/{tp+fn} high-risk patients detected")
print(f"Specificity: {specificity:.3f}  — {tn}/{tn+fp} safe patients correctly cleared")
print(f"Recall:      {recall_score(y_true, y_pred):.3f}  — same as sensitivity")

The Tradeoff

Sensitivity and specificity trade off against each other as you change the classification threshold. There is no threshold that maximizes both simultaneously.

Python

import numpy as np
from sklearn.metrics import confusion_matrix
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
y_proba = model.predict_proba(X_test)[:, 1]

print(f"{'Threshold':>10}  {'Sensitivity':>12}  {'Specificity':>12}")
print("-" * 40)
for threshold in np.arange(0.1, 0.95, 0.1):
    y_pred_t = (y_proba >= threshold).astype(int)
    tn, fp, fn, tp = confusion_matrix(y_test, y_pred_t).ravel()
    sens = tp/(tp+fn) if (tp+fn) > 0 else 0
    spec = tn/(tn+fp) if (tn+fp) > 0 else 0
    print(f"{threshold:>10.1f}  {sens:>12.3f}  {spec:>12.3f}")

# Low threshold:  high sensitivity, low specificity (catch everything, many false alarms)
# High threshold: low sensitivity, high specificity (very selective, miss some real cases)

Sensitivity for Screening, Specificity for Confirmation

Medical testing strategy:

Phase 1 — Screening: high sensitivity test
  Goal: don't miss any true cases — cast a wide net
  Accept: more false positives (some healthy patients flagged)
  Example: mammography, HbA1c screening for diabetes

Phase 2 — Confirmation: high specificity test
  Goal: confirm only true cases — reduce false alarms
  Accept: slightly lower sensitivity (some true cases that pass screening missed)
  Example: biopsy, glucose tolerance test, genetic confirmation

This "rule out / rule in" strategy is the basis of many clinical testing protocols.
ML models used in screening should optimize for sensitivity.
ML models used in diagnosis confirmation should optimize for specificity.

Youden's J Index: Optimal Threshold

Python

# Youden's J = Sensitivity + Specificity - 1
# Maximizing J picks the threshold that balances both optimally

import numpy as np
from sklearn.metrics import confusion_matrix, roc_curve

fpr_vals, tpr_vals, thresholds = roc_curve(y_test, y_proba)

# tpr = sensitivity, 1-fpr = specificity
j_scores = tpr_vals - fpr_vals   # = sensitivity + specificity - 1
best_idx = np.argmax(j_scores)
best_threshold = thresholds[best_idx]

sensitivity_at_best = tpr_vals[best_idx]
specificity_at_best = 1 - fpr_vals[best_idx]

print(f"Optimal threshold (Youden's J): {best_threshold:.3f}")
print(f"  Sensitivity: {sensitivity_at_best:.3f}")
print(f"  Specificity: {specificity_at_best:.3f}")
print(f"  Youden's J:  {j_scores[best_idx]:.3f}")

ROC Curve: All Thresholds at Once

Python

from sklearn.metrics import roc_curve, roc_auc_score
import numpy as np

fpr, tpr, thresholds = roc_curve(y_test, y_proba)

# FPR = 1 - Specificity
# TPR = Sensitivity

print("ROC curve (Sensitivity vs 1-Specificity):")
print(f"{'FPR (1-Spec)':>14}  {'TPR (Sens)':>12}  {'Threshold':>10}")
print("-" * 42)
for f, t, th in zip(fpr[::5], tpr[::5], thresholds[::5]):
    print(f"{f:>14.3f}  {t:>12.3f}  {th:>10.3f}")

print(f"\nAUC-ROC: {roc_auc_score(y_test, y_proba):.3f}")
# AUC = probability that model ranks a random positive above a random negative
# AUC = 0.5: no better than random
# AUC = 1.0: perfect ranking

Predictive Values vs Sensitivity/Specificity

Python

# Sensitivity and specificity are properties of the TEST (or model)
# They don't depend on prevalence

# Positive Predictive Value (PPV = precision) and Negative Predictive Value (NPV)
# depend on prevalence — the same test has different PPV in different populations

tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
prevalence = (tp + fn) / (tp + fp + tn + fn)

ppv = tp / (tp + fp)   # precision — depends on prevalence
npv = tn / (tn + fn)   # depends on prevalence

print(f"Prevalence:              {prevalence:.3f}")
print(f"Sensitivity (fixed):     {tp/(tp+fn):.3f}  — property of the model")
print(f"Specificity (fixed):     {tn/(tn+fp):.3f}  — property of the model")
print(f"PPV (depends on prev):   {ppv:.3f}  — varies by population")
print(f"NPV (depends on prev):   {npv:.3f}  — varies by population")

The Prevalence Trap

Python

# A model with 90% sensitivity and 90% specificity looks great
# But PPV can be very low in low-prevalence diseases

def compute_ppv(sensitivity, specificity, prevalence):
    tp = sensitivity * prevalence
    fp = (1 - specificity) * (1 - prevalence)
    return tp / (tp + fp)

# Rare disease: 1% prevalence
ppv_rare = compute_ppv(sensitivity=0.90, specificity=0.90, prevalence=0.01)
print(f"PPV for 1% prevalence:  {ppv_rare:.3f}")
# Only 8.3% — 9 out of 10 positive test results are false alarms

# Common disease: 30% prevalence
ppv_common = compute_ppv(sensitivity=0.90, specificity=0.90, prevalence=0.30)
print(f"PPV for 30% prevalence: {ppv_common:.3f}")
# 79.4% — much more useful in practice

# This is why screening a general population for a rare disease is hard
# even with a "good" model

Interview Answer Template

Q: What is the difference between sensitivity and specificity?

Sensitivity (or recall) measures how many actual positive cases a test catches: TP / (TP + FN). Specificity measures how many actual negative cases a test correctly clears: TN / (TN + FP). They trade off — you can't maximize both simultaneously. The classic medical strategy is to use a high-sensitivity test for initial screening (cast a wide net, accept some false alarms) and a high-specificity test for confirmation (rule in only true cases, accept some missed). An important subtlety: sensitivity and specificity are properties of the model and don't change with prevalence, but positive predictive value (precision) does. A model with 90% sensitivity and 90% specificity only has 8% PPV in a population where the disease occurs 1% of the time — 9 out of 10 alarms are false. This is the prevalence trap that clinical ML teams must account for when deploying in a new patient population.

Sensitivity and Specificity

The Clinical Framing

Computing Both

The Tradeoff

Sensitivity for Screening, Specificity for Confirmation

Youden's J Index: Optimal Threshold

ROC Curve: All Thresholds at Once

Predictive Values vs Sensitivity/Specificity

The Prevalence Trap

Interview Answer Template

Enjoyed this article?

Leave a comment