Machine Learning Foundations · Lesson 49 of 70

Medical AI: When False Negatives Are Deadly

The Asymmetry of Medical Errors

A false positive in clinical ML:
  → Unnecessary test, extra monitoring, slightly higher cost
  → Patient safety is maintained (they receive extra attention)
  → Physician can override or investigate the alert

A false negative in clinical ML:
  → Patient with a real condition receives no alert
  → No safety net — the system has signaled "everything is fine"
  → Clinical team acts on a false reassurance
  → Outcome: delayed treatment, missed diagnosis, preventable harm

This asymmetry — where one error type is catastrophically more costly than the other — is the central challenge of clinical ML model evaluation.

Quantifying the Cost Difference

Python

import numpy as np
from sklearn.metrics import confusion_matrix

# Sepsis early warning model — 500 ICU patients, 50 have sepsis
y_true = np.array([0]*450 + [1]*50)

# Model A: high recall, lower precision (catches most sepsis, some false alarms)
y_pred_a = np.array([0]*400 + [1]*50 + [0]*5 + [1]*45)
#                    TN=400   FP=50   FN=5   TP=45

# Model B: high precision, lower recall (fewer false alarms, misses more sepsis)
y_pred_b = np.array([0]*440 + [1]*10 + [0]*20 + [1]*30)
#                    TN=440   FP=10   FN=20  TP=30

def compute_clinical_cost(y_true, y_pred, cost_fn, cost_fp):
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
    return fn * cost_fn + fp * cost_fp

# Sepsis: FN = delayed treatment (~$50K ICU costs + mortality risk)
#         FP = extra workup, extra labs (~$500 per false alarm)
cost_a = compute_clinical_cost(y_true, y_pred_a, cost_fn=50000, cost_fp=500)
cost_b = compute_clinical_cost(y_true, y_pred_b, cost_fn=50000, cost_fp=500)

print(f"Model A (high recall): total cost = ${cost_a:,}")
print(f"Model B (high precision): total cost = ${cost_b:,}")
# Model A: 5 FN × $50K + 50 FP × $500 = $275,000
# Model B: 20 FN × $50K + 10 FP × $500 = $1,005,000
# Model A is cheaper despite more false alarms

The False Reassurance Problem

Python

# The most dangerous property of a false negative:
# The system has communicated "no action needed"
# The physician trusts the system and moves on

# This is especially dangerous when:
# 1. The condition deteriorates quickly (sepsis, PE, stroke)
# 2. The system is used to triage who needs manual review
# 3. The physician is overloaded and treats alerts as the primary filter

# Compare to a false positive:
# Physician reviews the alert, decides it's a false alarm, moves on
# Extra 2 minutes of physician time, no patient harm

# Design implication: clinical triage systems should default to high recall
# → Only lower recall if there's a strong clinical argument (alert fatigue with evidence)

Designing for High Recall

Python

from sklearn.metrics import recall_score, precision_score
from sklearn.linear_model import LogisticRegression
from sklearn.calibration import CalibratedClassifierCV
import numpy as np

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

y_proba = model.predict_proba(X_test)[:, 1]

# Step 1: Find the threshold that achieves target recall
target_recall = 0.90

from sklearn.metrics import precision_recall_curve
precisions, recalls, thresholds = precision_recall_curve(y_test, y_proba)

recall_threshold = None
for t, p, r in zip(thresholds, precisions[:-1], recalls[:-1]):
    if r >= target_recall:
        recall_threshold = t
        selected_precision = p
        break

if recall_threshold:
    print(f"Threshold for recall={target_recall}: {recall_threshold:.3f}")
    print(f"  Precision at this threshold: {selected_precision:.3f}")
    print(f"  For every {1/selected_precision:.0f} alarms, 1 is a true positive")

# Step 2: Apply threshold and measure
y_pred_high_recall = (y_proba >= recall_threshold).astype(int)
print(f"\nActual recall:     {recall_score(y_test, y_pred_high_recall):.3f}")
print(f"Actual precision:  {precision_score(y_test, y_pred_high_recall):.3f}")

Alert Fatigue: The Counterargument

Python

# Alert fatigue is a real clinical phenomenon
# If a system generates too many false positives, physicians stop responding to alerts
# This can make even the true positives go unaddressed

# The tension:
# Too few false positives → some true positives missed (FN problem)
# Too many false positives → physicians ignore all alerts including true ones

# Quantifying alert fatigue threshold
# Research suggests: more than 5-10% of alerts being true positives is needed
# for physicians to maintain vigilance

def check_alert_fatigue_risk(precision, threshold_pct=0.10):
    """
    If precision is below threshold, alert fatigue is likely.
    """
    if precision < threshold_pct:
        print(f"WARNING: Precision {precision:.1%} is below {threshold_pct:.0%}")
        print("Alert fatigue risk: physicians may stop responding to alerts")
        print("Consider: higher threshold, risk stratification, or bundled alerts")
    else:
        print(f"Precision {precision:.1%} — alert fatigue risk is manageable")

Risk Stratification Instead of Binary Alert

Python

# Instead of a binary alarm (high risk / low risk),
# use a tiered approach that gives more context and reduces alarm fatigue

def risk_stratify(y_proba: np.ndarray) -> np.ndarray:
    """
    Three-tier risk stratification:
    Green = monitor normally
    Yellow = enhanced monitoring (more frequent checks)
    Red = urgent clinical review
    """
    tiers = np.select(
        condlist=[
            y_proba >= 0.70,        # high risk
            y_proba >= 0.40,        # moderate risk
        ],
        choicelist=["RED", "YELLOW"],
        default="GREEN"
    )
    return tiers

y_proba = model.predict_proba(X_test)[:, 1]
tiers = risk_stratify(y_proba)

for tier in ["RED", "YELLOW", "GREEN"]:
    count = (tiers == tier).sum()
    true_pos = y_test[tiers == tier].sum() if hasattr(y_test, '__getitem__') else 0
    print(f"{tier}: {count} patients flagged")

# Benefits:
# - Red tier has very high precision (fewer false alarms)
# - Yellow tier catches moderate risk before it escalates
# - Clinicians can calibrate their response to the tier

What to Report for Safety-Critical Systems

Python

from sklearn.metrics import (
    recall_score, precision_score, f1_score, fbeta_score,
    classification_report, roc_auc_score
)

y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)[:, 1]

print("=== Clinical Safety Metrics ===")
print(f"Recall (sensitivity):  {recall_score(y_test, y_pred):.3f}  ← PRIMARY metric")
print(f"Precision (PPV):       {precision_score(y_test, y_pred):.3f}")
print(f"F2 score:              {fbeta_score(y_test, y_pred, beta=2):.3f}  ← recall weighted 2x")
print(f"AUC-ROC:               {roc_auc_score(y_test, y_proba):.3f}")
print(f"\nFalse negative rate:   {1 - recall_score(y_test, y_pred):.3f}  ← miss rate")
print(classification_report(y_test, y_pred, target_names=["no_sepsis", "sepsis"]))

Interview Answer Template

Q: Why might false negatives be more important than false positives in clinical ML?

In clinical ML, a false positive means an unnecessary alert — the physician reviews it, determines it's a false alarm, and moves on. No patient harm occurs. A false negative means a sick patient receives a clean bill of health from the system — no alert fires, no action is taken, and the condition progresses untreated. The system has provided false reassurance. For time-sensitive conditions like sepsis, missed detection means delayed treatment and significantly higher mortality. This cost asymmetry means I optimize for recall first in clinical screening models — find the threshold that achieves recall ≥ 0.90 or whatever the clinical team specifies, then see what precision that yields. The counterargument is alert fatigue: if precision is too low, physicians stop responding to all alerts, including true ones. The resolution is risk stratification — tiered alerts instead of binary, so high-confidence predictions get urgent attention and moderate-confidence ones get enhanced monitoring.

Sensitivity and Specificity

Next Lesson

Interview: Confusion Matrix Scenario