Sensitivity and Specificity
Sensitivity (recall) and specificity: clinical definitions, formulas, the sensitivity-specificity tradeoff, Youden's J, and why medical tests prioritize sensitivity for screening and specificity for confirmation.
The Clinical Framing
Sensitivity and specificity are the clinical equivalents of recall and (1 - false positive rate). They appear constantly in medical literature, FDA submissions, and clinical validation studies.
Sensitivity = TP / (TP + FN) = Recall
"Of all patients WHO HAVE the condition, how many did we detect?"
Specificity = TN / (TN + FP) = 1 - False Positive Rate
"Of all patients WHO DON'T HAVE the condition, how many did we correctly clear?"Computing Both
from sklearn.metrics import confusion_matrix, recall_score
import numpy as np
# Warfarin bleeding risk model
y_true = np.array([0]*170 + [1]*30)
y_pred = np.array([0]*158 + [1]*12 + [0]*7 + [1]*23)
tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
sensitivity = tp / (tp + fn) # same as recall
specificity = tn / (tn + fp)
print(f"Sensitivity: {sensitivity:.3f} ā {tp}/{tp+fn} high-risk patients detected")
print(f"Specificity: {specificity:.3f} ā {tn}/{tn+fp} safe patients correctly cleared")
print(f"Recall: {recall_score(y_true, y_pred):.3f} ā same as sensitivity")The Tradeoff
Sensitivity and specificity trade off against each other as you change the classification threshold. There is no threshold that maximizes both simultaneously.
import numpy as np
from sklearn.metrics import confusion_matrix
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
y_proba = model.predict_proba(X_test)[:, 1]
print(f"{'Threshold':>10} {'Sensitivity':>12} {'Specificity':>12}")
print("-" * 40)
for threshold in np.arange(0.1, 0.95, 0.1):
y_pred_t = (y_proba >= threshold).astype(int)
tn, fp, fn, tp = confusion_matrix(y_test, y_pred_t).ravel()
sens = tp/(tp+fn) if (tp+fn) > 0 else 0
spec = tn/(tn+fp) if (tn+fp) > 0 else 0
print(f"{threshold:>10.1f} {sens:>12.3f} {spec:>12.3f}")
# Low threshold: high sensitivity, low specificity (catch everything, many false alarms)
# High threshold: low sensitivity, high specificity (very selective, miss some real cases)Sensitivity for Screening, Specificity for Confirmation
Medical testing strategy:
Phase 1 ā Screening: high sensitivity test
Goal: don't miss any true cases ā cast a wide net
Accept: more false positives (some healthy patients flagged)
Example: mammography, HbA1c screening for diabetes
Phase 2 ā Confirmation: high specificity test
Goal: confirm only true cases ā reduce false alarms
Accept: slightly lower sensitivity (some true cases that pass screening missed)
Example: biopsy, glucose tolerance test, genetic confirmation
This "rule out / rule in" strategy is the basis of many clinical testing protocols.
ML models used in screening should optimize for sensitivity.
ML models used in diagnosis confirmation should optimize for specificity.Youden's J Index: Optimal Threshold
# Youden's J = Sensitivity + Specificity - 1
# Maximizing J picks the threshold that balances both optimally
import numpy as np
from sklearn.metrics import confusion_matrix, roc_curve
fpr_vals, tpr_vals, thresholds = roc_curve(y_test, y_proba)
# tpr = sensitivity, 1-fpr = specificity
j_scores = tpr_vals - fpr_vals # = sensitivity + specificity - 1
best_idx = np.argmax(j_scores)
best_threshold = thresholds[best_idx]
sensitivity_at_best = tpr_vals[best_idx]
specificity_at_best = 1 - fpr_vals[best_idx]
print(f"Optimal threshold (Youden's J): {best_threshold:.3f}")
print(f" Sensitivity: {sensitivity_at_best:.3f}")
print(f" Specificity: {specificity_at_best:.3f}")
print(f" Youden's J: {j_scores[best_idx]:.3f}")ROC Curve: All Thresholds at Once
from sklearn.metrics import roc_curve, roc_auc_score
import numpy as np
fpr, tpr, thresholds = roc_curve(y_test, y_proba)
# FPR = 1 - Specificity
# TPR = Sensitivity
print("ROC curve (Sensitivity vs 1-Specificity):")
print(f"{'FPR (1-Spec)':>14} {'TPR (Sens)':>12} {'Threshold':>10}")
print("-" * 42)
for f, t, th in zip(fpr[::5], tpr[::5], thresholds[::5]):
print(f"{f:>14.3f} {t:>12.3f} {th:>10.3f}")
print(f"\nAUC-ROC: {roc_auc_score(y_test, y_proba):.3f}")
# AUC = probability that model ranks a random positive above a random negative
# AUC = 0.5: no better than random
# AUC = 1.0: perfect rankingPredictive Values vs Sensitivity/Specificity
# Sensitivity and specificity are properties of the TEST (or model)
# They don't depend on prevalence
# Positive Predictive Value (PPV = precision) and Negative Predictive Value (NPV)
# depend on prevalence ā the same test has different PPV in different populations
tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
prevalence = (tp + fn) / (tp + fp + tn + fn)
ppv = tp / (tp + fp) # precision ā depends on prevalence
npv = tn / (tn + fn) # depends on prevalence
print(f"Prevalence: {prevalence:.3f}")
print(f"Sensitivity (fixed): {tp/(tp+fn):.3f} ā property of the model")
print(f"Specificity (fixed): {tn/(tn+fp):.3f} ā property of the model")
print(f"PPV (depends on prev): {ppv:.3f} ā varies by population")
print(f"NPV (depends on prev): {npv:.3f} ā varies by population")The Prevalence Trap
# A model with 90% sensitivity and 90% specificity looks great
# But PPV can be very low in low-prevalence diseases
def compute_ppv(sensitivity, specificity, prevalence):
tp = sensitivity * prevalence
fp = (1 - specificity) * (1 - prevalence)
return tp / (tp + fp)
# Rare disease: 1% prevalence
ppv_rare = compute_ppv(sensitivity=0.90, specificity=0.90, prevalence=0.01)
print(f"PPV for 1% prevalence: {ppv_rare:.3f}")
# Only 8.3% ā 9 out of 10 positive test results are false alarms
# Common disease: 30% prevalence
ppv_common = compute_ppv(sensitivity=0.90, specificity=0.90, prevalence=0.30)
print(f"PPV for 30% prevalence: {ppv_common:.3f}")
# 79.4% ā much more useful in practice
# This is why screening a general population for a rare disease is hard
# even with a "good" modelInterview Answer Template
Q: What is the difference between sensitivity and specificity?
Sensitivity (or recall) measures how many actual positive cases a test catches: TP / (TP + FN). Specificity measures how many actual negative cases a test correctly clears: TN / (TN + FP). They trade off ā you can't maximize both simultaneously. The classic medical strategy is to use a high-sensitivity test for initial screening (cast a wide net, accept some false alarms) and a high-specificity test for confirmation (rule in only true cases, accept some missed). An important subtlety: sensitivity and specificity are properties of the model and don't change with prevalence, but positive predictive value (precision) does. A model with 90% sensitivity and 90% specificity only has 8% PPV in a population where the disease occurs 1% of the time ā 9 out of 10 alarms are false. This is the prevalence trap that clinical ML teams must account for when deploying in a new patient population.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.