Machine Learning Foundations · Lesson 42 of 70
Precision and Recall Explained
The Two Questions
Precision: Of all the cases I flagged as positive, how many actually were?
"When I raise an alarm, how often am I right?"
Recall: Of all the actual positives, how many did I catch?
"How many real positives did I find?"Formulas
Precision = TP / (TP + FP)
TP = True Positives (correctly flagged as positive)
FP = False Positives (incorrectly flagged as positive — false alarm)
Recall = TP / (TP + FN)
FN = False Negatives (missed positives — should have been flagged)
F1 = 2 × (Precision × Recall) / (Precision + Recall)
= harmonic mean — low if either precision or recall is lowComputing Them
from sklearn.metrics import precision_score, recall_score, f1_score, classification_report
import numpy as np
# Sepsis early warning model
# 100 patients: 20 have sepsis (positive), 80 don't (negative)
y_true = np.array([1]*20 + [0]*80)
y_pred = np.array([1]*15 + [0]*5 + [0]*70 + [1]*10)
# TP=15 FN=5 TN=70 FP=10
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)
print(f"Precision: {precision:.3f}") # 15/(15+10) = 0.600
print(f"Recall: {recall:.3f}") # 15/(15+5) = 0.750
print(f"F1: {f1:.3f}") # harmonic mean
print("\nFull report:")
print(classification_report(y_true, y_pred, target_names=["no_sepsis", "sepsis"]))The Precision-Recall Tradeoff
Raising the classification threshold increases precision and decreases recall. Lowering it increases recall at the cost of precision. You can't maximize both simultaneously.
import numpy as np
from sklearn.metrics import precision_recall_curve
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
y_proba = model.predict_proba(X_test)[:, 1]
# Compute precision and recall at each threshold
precisions, recalls, thresholds = precision_recall_curve(y_test, y_proba)
# Print the tradeoff across thresholds
print(f"{'Threshold':>10} {'Precision':>10} {'Recall':>8} {'F1':>8}")
print("-" * 42)
for t, p, r in zip(thresholds[::10], precisions[::10], recalls[::10]):
f1 = 2*p*r/(p+r+1e-9)
print(f"{t:>10.3f} {p:>10.3f} {r:>8.3f} {f1:>8.3f}")
# Low threshold → high recall, low precision (catch everything, many false alarms)
# High threshold → high precision, low recall (only flag when very confident, miss some)When to Prioritize Precision
Prioritize precision when false positives are costly:
Example: Drug adverse event alert (pop-up in EHR)
False positive: alert fires when patient is not at risk
→ physician alert fatigue, distrust of the system, wasted time
False negative: missed adverse event
→ patient harm (but physician may catch it manually)
→ Prefer precision: only fire when you're confident
→ Use a higher threshold, accept lower recall
Other examples:
- LLM safety classifier false positive: blocking a legitimate request
- Spam filter: deleting a real email
- Clinical trial enrollment: including ineligible patientsHigh-Precision Configuration
from sklearn.metrics import precision_score, recall_score
y_proba = model.predict_proba(X_test)[:, 1]
# Raise threshold until precision reaches 0.90
for threshold in np.arange(0.3, 0.95, 0.05):
y_pred_t = (y_proba >= threshold).astype(int)
if y_pred_t.sum() == 0:
break
p = precision_score(y_test, y_pred_t, zero_division=0)
r = recall_score(y_test, y_pred_t, zero_division=0)
print(f"threshold={threshold:.2f}: precision={p:.3f}, recall={r:.3f}")When to Prioritize Recall
Prioritize recall when false negatives are costly:
Example: Sepsis early warning
False negative: sepsis patient not flagged → delayed treatment → higher mortality
False positive: non-sepsis patient flagged → extra labs, physician review (manageable)
→ Prefer recall: catch as many positives as possible
→ Use a lower threshold, accept more false alarms
Other examples:
- Cancer screening: missing a cancer is worse than an unnecessary biopsy
- Drug-drug interaction detection: missing an interaction is worse than a false alert
- LLM harmful content detection: letting harmful content through vs blocking legitimate content
- Readmission model (discharge planning): missing a high-risk patient is much worseHigh-Recall Configuration
# Lower threshold until recall reaches 0.90
for threshold in np.arange(0.7, 0.05, -0.05):
y_pred_t = (y_proba >= threshold).astype(int)
p = precision_score(y_test, y_pred_t, zero_division=0)
r = recall_score(y_test, y_pred_t, zero_division=0)
print(f"threshold={threshold:.2f}: precision={p:.3f}, recall={r:.3f}")
if r >= 0.90:
print(f"→ Threshold {threshold:.2f} achieves recall ≥ 0.90")
breakMacro vs Micro vs Weighted Averaging (Multi-Class)
from sklearn.metrics import f1_score
# Multi-class: drug category prediction (anticoagulant, antidiabetic, antihypertensive, antibiotic)
y_true = [0, 1, 2, 3, 0, 1, 2, 3, 0, 0]
y_pred = [0, 1, 2, 2, 0, 1, 3, 3, 1, 0]
# Macro: average F1 across classes, each class weighted equally
# Good when you care about all classes equally (including rare ones)
f1_macro = f1_score(y_true, y_pred, average="macro")
# Weighted: average weighted by class support (frequency)
# Good when class imbalance exists and you want a fair overall picture
f1_weighted = f1_score(y_true, y_pred, average="weighted")
# Micro: compute TP/FP/FN globally across all classes
f1_micro = f1_score(y_true, y_pred, average="micro")
print(f"Macro F1: {f1_macro:.3f}")
print(f"Weighted F1: {f1_weighted:.3f}")
print(f"Micro F1: {f1_micro:.3f}")
# For imbalanced multi-class with rare but important classes: use macro F1
# (ensures rare class performance matters as much as common classes)Precision-Recall Curve vs ROC Curve
ROC curve (TPR vs FPR):
- Works well when classes are balanced
- FPR is affected by the large number of true negatives in imbalanced data
- Can look optimistic for imbalanced datasets
Precision-Recall curve:
- Focuses on the minority class
- Not affected by true negatives
- Better for severe imbalance (e.g., 99% negative)
- AUC-PR (average precision) is the right summary metric for imbalanced datafrom sklearn.metrics import average_precision_score, roc_auc_score
y_proba = model.predict_proba(X_test)[:, 1]
auc_roc = roc_auc_score(y_test, y_proba)
auc_pr = average_precision_score(y_test, y_proba)
print(f"AUC-ROC: {auc_roc:.3f} (use for balanced data)")
print(f"Average Precision: {auc_pr:.3f} (use for imbalanced data)")Interview Answer Template
Q: What's the difference between precision and recall, and when do you use each?
Precision is the fraction of positive predictions that are actually positive — "when I raise an alarm, how often am I right?" Recall is the fraction of actual positives that were identified — "how many real positives did I catch?" They trade off: a lower threshold catches more positives (higher recall) but also more false alarms (lower precision). Which to optimize depends on cost asymmetry. For sepsis detection, missing a case is catastrophic — prioritize recall, accept more false alarms. For a drug alert pop-up system, too many false alarms cause alert fatigue — prioritize precision, accept that some cases won't be flagged. In practice, I report both plus F1 (their harmonic mean), choose the threshold based on the clinical cost of each error type, and use the precision-recall curve to visualize the full tradeoff before committing to a threshold.