Machine Learning Foundations · Lesson 49 of 70
Medical AI: When False Negatives Are Deadly
The Asymmetry of Medical Errors
A false positive in clinical ML:
→ Unnecessary test, extra monitoring, slightly higher cost
→ Patient safety is maintained (they receive extra attention)
→ Physician can override or investigate the alert
A false negative in clinical ML:
→ Patient with a real condition receives no alert
→ No safety net — the system has signaled "everything is fine"
→ Clinical team acts on a false reassurance
→ Outcome: delayed treatment, missed diagnosis, preventable harmThis asymmetry — where one error type is catastrophically more costly than the other — is the central challenge of clinical ML model evaluation.
Quantifying the Cost Difference
import numpy as np
from sklearn.metrics import confusion_matrix
# Sepsis early warning model — 500 ICU patients, 50 have sepsis
y_true = np.array([0]*450 + [1]*50)
# Model A: high recall, lower precision (catches most sepsis, some false alarms)
y_pred_a = np.array([0]*400 + [1]*50 + [0]*5 + [1]*45)
# TN=400 FP=50 FN=5 TP=45
# Model B: high precision, lower recall (fewer false alarms, misses more sepsis)
y_pred_b = np.array([0]*440 + [1]*10 + [0]*20 + [1]*30)
# TN=440 FP=10 FN=20 TP=30
def compute_clinical_cost(y_true, y_pred, cost_fn, cost_fp):
tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
return fn * cost_fn + fp * cost_fp
# Sepsis: FN = delayed treatment (~$50K ICU costs + mortality risk)
# FP = extra workup, extra labs (~$500 per false alarm)
cost_a = compute_clinical_cost(y_true, y_pred_a, cost_fn=50000, cost_fp=500)
cost_b = compute_clinical_cost(y_true, y_pred_b, cost_fn=50000, cost_fp=500)
print(f"Model A (high recall): total cost = ${cost_a:,}")
print(f"Model B (high precision): total cost = ${cost_b:,}")
# Model A: 5 FN × $50K + 50 FP × $500 = $275,000
# Model B: 20 FN × $50K + 10 FP × $500 = $1,005,000
# Model A is cheaper despite more false alarmsThe False Reassurance Problem
# The most dangerous property of a false negative:
# The system has communicated "no action needed"
# The physician trusts the system and moves on
# This is especially dangerous when:
# 1. The condition deteriorates quickly (sepsis, PE, stroke)
# 2. The system is used to triage who needs manual review
# 3. The physician is overloaded and treats alerts as the primary filter
# Compare to a false positive:
# Physician reviews the alert, decides it's a false alarm, moves on
# Extra 2 minutes of physician time, no patient harm
# Design implication: clinical triage systems should default to high recall
# → Only lower recall if there's a strong clinical argument (alert fatigue with evidence)Designing for High Recall
from sklearn.metrics import recall_score, precision_score
from sklearn.linear_model import LogisticRegression
from sklearn.calibration import CalibratedClassifierCV
import numpy as np
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
y_proba = model.predict_proba(X_test)[:, 1]
# Step 1: Find the threshold that achieves target recall
target_recall = 0.90
from sklearn.metrics import precision_recall_curve
precisions, recalls, thresholds = precision_recall_curve(y_test, y_proba)
recall_threshold = None
for t, p, r in zip(thresholds, precisions[:-1], recalls[:-1]):
if r >= target_recall:
recall_threshold = t
selected_precision = p
break
if recall_threshold:
print(f"Threshold for recall={target_recall}: {recall_threshold:.3f}")
print(f" Precision at this threshold: {selected_precision:.3f}")
print(f" For every {1/selected_precision:.0f} alarms, 1 is a true positive")
# Step 2: Apply threshold and measure
y_pred_high_recall = (y_proba >= recall_threshold).astype(int)
print(f"\nActual recall: {recall_score(y_test, y_pred_high_recall):.3f}")
print(f"Actual precision: {precision_score(y_test, y_pred_high_recall):.3f}")Alert Fatigue: The Counterargument
# Alert fatigue is a real clinical phenomenon
# If a system generates too many false positives, physicians stop responding to alerts
# This can make even the true positives go unaddressed
# The tension:
# Too few false positives → some true positives missed (FN problem)
# Too many false positives → physicians ignore all alerts including true ones
# Quantifying alert fatigue threshold
# Research suggests: more than 5-10% of alerts being true positives is needed
# for physicians to maintain vigilance
def check_alert_fatigue_risk(precision, threshold_pct=0.10):
"""
If precision is below threshold, alert fatigue is likely.
"""
if precision < threshold_pct:
print(f"WARNING: Precision {precision:.1%} is below {threshold_pct:.0%}")
print("Alert fatigue risk: physicians may stop responding to alerts")
print("Consider: higher threshold, risk stratification, or bundled alerts")
else:
print(f"Precision {precision:.1%} — alert fatigue risk is manageable")Risk Stratification Instead of Binary Alert
# Instead of a binary alarm (high risk / low risk),
# use a tiered approach that gives more context and reduces alarm fatigue
def risk_stratify(y_proba: np.ndarray) -> np.ndarray:
"""
Three-tier risk stratification:
Green = monitor normally
Yellow = enhanced monitoring (more frequent checks)
Red = urgent clinical review
"""
tiers = np.select(
condlist=[
y_proba >= 0.70, # high risk
y_proba >= 0.40, # moderate risk
],
choicelist=["RED", "YELLOW"],
default="GREEN"
)
return tiers
y_proba = model.predict_proba(X_test)[:, 1]
tiers = risk_stratify(y_proba)
for tier in ["RED", "YELLOW", "GREEN"]:
count = (tiers == tier).sum()
true_pos = y_test[tiers == tier].sum() if hasattr(y_test, '__getitem__') else 0
print(f"{tier}: {count} patients flagged")
# Benefits:
# - Red tier has very high precision (fewer false alarms)
# - Yellow tier catches moderate risk before it escalates
# - Clinicians can calibrate their response to the tierWhat to Report for Safety-Critical Systems
from sklearn.metrics import (
recall_score, precision_score, f1_score, fbeta_score,
classification_report, roc_auc_score
)
y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)[:, 1]
print("=== Clinical Safety Metrics ===")
print(f"Recall (sensitivity): {recall_score(y_test, y_pred):.3f} ← PRIMARY metric")
print(f"Precision (PPV): {precision_score(y_test, y_pred):.3f}")
print(f"F2 score: {fbeta_score(y_test, y_pred, beta=2):.3f} ← recall weighted 2x")
print(f"AUC-ROC: {roc_auc_score(y_test, y_proba):.3f}")
print(f"\nFalse negative rate: {1 - recall_score(y_test, y_pred):.3f} ← miss rate")
print(classification_report(y_test, y_pred, target_names=["no_sepsis", "sepsis"]))Interview Answer Template
Q: Why might false negatives be more important than false positives in clinical ML?
In clinical ML, a false positive means an unnecessary alert — the physician reviews it, determines it's a false alarm, and moves on. No patient harm occurs. A false negative means a sick patient receives a clean bill of health from the system — no alert fires, no action is taken, and the condition progresses untreated. The system has provided false reassurance. For time-sensitive conditions like sepsis, missed detection means delayed treatment and significantly higher mortality. This cost asymmetry means I optimize for recall first in clinical screening models — find the threshold that achieves recall ≥ 0.90 or whatever the clinical team specifies, then see what precision that yields. The counterargument is alert fatigue: if precision is too low, physicians stop responding to all alerts, including true ones. The resolution is risk stratification — tiered alerts instead of binary, so high-confidence predictions get urgent attention and moderate-confidence ones get enhanced monitoring.