Machine Learning Foundations · Lesson 42 of 70

Precision and Recall Explained

The Two Questions

Precision: Of all the cases I flagged as positive, how many actually were?
           "When I raise an alarm, how often am I right?"

Recall:    Of all the actual positives, how many did I catch?
           "How many real positives did I find?"

Formulas

Precision = TP / (TP + FP)
  TP = True Positives  (correctly flagged as positive)
  FP = False Positives (incorrectly flagged as positive — false alarm)

Recall = TP / (TP + FN)
  FN = False Negatives (missed positives — should have been flagged)

F1 = 2 × (Precision × Recall) / (Precision + Recall)
   = harmonic mean — low if either precision or recall is low

Computing Them

Python

from sklearn.metrics import precision_score, recall_score, f1_score, classification_report
import numpy as np

# Sepsis early warning model
# 100 patients: 20 have sepsis (positive), 80 don't (negative)
y_true = np.array([1]*20 + [0]*80)
y_pred = np.array([1]*15 + [0]*5 + [0]*70 + [1]*10)
#                  TP=15   FN=5   TN=70    FP=10

precision = precision_score(y_true, y_pred)
recall    = recall_score(y_true, y_pred)
f1        = f1_score(y_true, y_pred)

print(f"Precision: {precision:.3f}")   # 15/(15+10) = 0.600
print(f"Recall:    {recall:.3f}")      # 15/(15+5)  = 0.750
print(f"F1:        {f1:.3f}")          # harmonic mean

print("\nFull report:")
print(classification_report(y_true, y_pred, target_names=["no_sepsis", "sepsis"]))

The Precision-Recall Tradeoff

Raising the classification threshold increases precision and decreases recall. Lowering it increases recall at the cost of precision. You can't maximize both simultaneously.

Python

import numpy as np
from sklearn.metrics import precision_recall_curve
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
y_proba = model.predict_proba(X_test)[:, 1]

# Compute precision and recall at each threshold
precisions, recalls, thresholds = precision_recall_curve(y_test, y_proba)

# Print the tradeoff across thresholds
print(f"{'Threshold':>10}  {'Precision':>10}  {'Recall':>8}  {'F1':>8}")
print("-" * 42)
for t, p, r in zip(thresholds[::10], precisions[::10], recalls[::10]):
    f1 = 2*p*r/(p+r+1e-9)
    print(f"{t:>10.3f}  {p:>10.3f}  {r:>8.3f}  {f1:>8.3f}")

# Low threshold → high recall, low precision (catch everything, many false alarms)
# High threshold → high precision, low recall (only flag when very confident, miss some)

When to Prioritize Precision

Prioritize precision when false positives are costly:

Example: Drug adverse event alert (pop-up in EHR)
  False positive: alert fires when patient is not at risk
    → physician alert fatigue, distrust of the system, wasted time
  False negative: missed adverse event
    → patient harm (but physician may catch it manually)

→ Prefer precision: only fire when you're confident
→ Use a higher threshold, accept lower recall

Other examples:
  - LLM safety classifier false positive: blocking a legitimate request
  - Spam filter: deleting a real email
  - Clinical trial enrollment: including ineligible patients

High-Precision Configuration

Python

from sklearn.metrics import precision_score, recall_score

y_proba = model.predict_proba(X_test)[:, 1]

# Raise threshold until precision reaches 0.90
for threshold in np.arange(0.3, 0.95, 0.05):
    y_pred_t = (y_proba >= threshold).astype(int)
    if y_pred_t.sum() == 0:
        break
    p = precision_score(y_test, y_pred_t, zero_division=0)
    r = recall_score(y_test, y_pred_t, zero_division=0)
    print(f"threshold={threshold:.2f}: precision={p:.3f}, recall={r:.3f}")

When to Prioritize Recall

Prioritize recall when false negatives are costly:

Example: Sepsis early warning
  False negative: sepsis patient not flagged → delayed treatment → higher mortality
  False positive: non-sepsis patient flagged → extra labs, physician review (manageable)

→ Prefer recall: catch as many positives as possible
→ Use a lower threshold, accept more false alarms

Other examples:
  - Cancer screening: missing a cancer is worse than an unnecessary biopsy
  - Drug-drug interaction detection: missing an interaction is worse than a false alert
  - LLM harmful content detection: letting harmful content through vs blocking legitimate content
  - Readmission model (discharge planning): missing a high-risk patient is much worse

High-Recall Configuration

Python

# Lower threshold until recall reaches 0.90
for threshold in np.arange(0.7, 0.05, -0.05):
    y_pred_t = (y_proba >= threshold).astype(int)
    p = precision_score(y_test, y_pred_t, zero_division=0)
    r = recall_score(y_test, y_pred_t, zero_division=0)
    print(f"threshold={threshold:.2f}: precision={p:.3f}, recall={r:.3f}")
    if r >= 0.90:
        print(f"→ Threshold {threshold:.2f} achieves recall ≥ 0.90")
        break

Macro vs Micro vs Weighted Averaging (Multi-Class)

Python

from sklearn.metrics import f1_score

# Multi-class: drug category prediction (anticoagulant, antidiabetic, antihypertensive, antibiotic)
y_true = [0, 1, 2, 3, 0, 1, 2, 3, 0, 0]
y_pred = [0, 1, 2, 2, 0, 1, 3, 3, 1, 0]

# Macro: average F1 across classes, each class weighted equally
# Good when you care about all classes equally (including rare ones)
f1_macro    = f1_score(y_true, y_pred, average="macro")

# Weighted: average weighted by class support (frequency)
# Good when class imbalance exists and you want a fair overall picture
f1_weighted = f1_score(y_true, y_pred, average="weighted")

# Micro: compute TP/FP/FN globally across all classes
f1_micro    = f1_score(y_true, y_pred, average="micro")

print(f"Macro F1:    {f1_macro:.3f}")
print(f"Weighted F1: {f1_weighted:.3f}")
print(f"Micro F1:    {f1_micro:.3f}")

# For imbalanced multi-class with rare but important classes: use macro F1
# (ensures rare class performance matters as much as common classes)

Precision-Recall Curve vs ROC Curve

ROC curve (TPR vs FPR):
  - Works well when classes are balanced
  - FPR is affected by the large number of true negatives in imbalanced data
  - Can look optimistic for imbalanced datasets

Precision-Recall curve:
  - Focuses on the minority class
  - Not affected by true negatives
  - Better for severe imbalance (e.g., 99% negative)
  - AUC-PR (average precision) is the right summary metric for imbalanced data

Python

from sklearn.metrics import average_precision_score, roc_auc_score

y_proba = model.predict_proba(X_test)[:, 1]

auc_roc = roc_auc_score(y_test, y_proba)
auc_pr  = average_precision_score(y_test, y_proba)

print(f"AUC-ROC:            {auc_roc:.3f}  (use for balanced data)")
print(f"Average Precision:  {auc_pr:.3f}  (use for imbalanced data)")

Interview Answer Template

Q: What's the difference between precision and recall, and when do you use each?

Precision is the fraction of positive predictions that are actually positive — "when I raise an alarm, how often am I right?" Recall is the fraction of actual positives that were identified — "how many real positives did I catch?" They trade off: a lower threshold catches more positives (higher recall) but also more false alarms (lower precision). Which to optimize depends on cost asymmetry. For sepsis detection, missing a case is catastrophic — prioritize recall, accept more false alarms. For a drug alert pop-up system, too many false alarms cause alert fatigue — prioritize precision, accept that some cases won't be flagged. In practice, I report both plus F1 (their harmonic mean), choose the threshold based on the clinical cost of each error type, and use the precision-recall curve to visualize the full tradeoff before committing to a threshold.

Why Accuracy Alone is Not Enough

Next Lesson

F1 Score: When to Use It