Machine Learning Foundations · Lesson 41 of 70
Why Accuracy Alone is Not Enough
The Accuracy Trap
Accuracy is the fraction of correct predictions. It sounds perfect. It isn't.
from sklearn.metrics import accuracy_score
import numpy as np
# Clinical dataset: 30-day readmission for diabetic patients
# 90% are NOT readmitted
y_true = np.array([0] * 900 + [1] * 100) # 900 negative, 100 positive
# A model that predicts "not readmitted" for everyone
y_always_negative = np.zeros(1000, dtype=int)
acc = accuracy_score(y_true, y_always_negative)
print(f"Accuracy of always-negative model: {acc:.2%}") # 90.0%
# This model has 90% accuracy — and is completely useless
# It never identifies a single patient at risk of readmissionWhy Imbalanced Classes Break Accuracy
Dataset: 90% class 0 (no readmission), 10% class 1 (readmission)
Always-predict-0 model:
Accuracy = 900/1000 = 90% ← sounds great
Recall for class 1 = 0/100 = 0% ← catches zero at-risk patients
A good model:
Accuracy = 830/1000 = 83% ← lower accuracy
Recall for class 1 = 75/100 = 75% ← catches 75% of at-risk patients
The "worse" model is dramatically more useful.from sklearn.metrics import classification_report
y_true = np.array([0]*900 + [1]*100)
y_baseline = np.zeros(1000, dtype=int)
y_good_model = np.concatenate([np.zeros(870), np.ones(30), np.zeros(25), np.ones(75)])
print("Always-negative model:")
print(classification_report(y_true, y_baseline, target_names=["no_readmit", "readmit"]))
print("Better model:")
print(classification_report(y_true, y_good_model.astype(int), target_names=["no_readmit", "readmit"]))The Baseline Comparison
from sklearn.dummy import DummyClassifier
from sklearn.model_selection import cross_val_score
# Always compare your model to the majority-class baseline
dummy = DummyClassifier(strategy="most_frequent")
model_scores = cross_val_score(your_model, X, y, cv=5, scoring="accuracy")
dummy_scores = cross_val_score(dummy, X, y, cv=5, scoring="accuracy")
print(f"Your model accuracy: {model_scores.mean():.2%}")
print(f"Baseline accuracy: {dummy_scores.mean():.2%}")
print(f"Improvement over baseline: {(model_scores.mean() - dummy_scores.mean()):.2%}")
# If improvement is small, your model has barely learned anythingWhen Accuracy IS Useful
Accuracy is a reasonable metric when:
- Classes are roughly balanced (±10%)
- All mistakes cost the same (false positive = false negative in severity)
- You're doing multi-class classification with roughly equal class frequencies
Accuracy is misleading when:
- Significant class imbalance exists
- One type of error is much more costly than the other
- You need to understand model behavior on the minority classCommon Imbalanced Clinical Scenarios
| Scenario | Typical Imbalance | Why Accuracy Fails | |---|---|---| | 30-day readmission | 10–20% positive | 80-90% accuracy from "never readmitted" baseline | | Rare drug adverse event | 1–5% positive | 95-99% accuracy from "no event" baseline | | Sepsis early warning | 5–15% positive | High accuracy from always-negative | | Drug-drug interaction flag | 2–8% positive | Almost all interactions are irrelevant | | LLM safety classifier | 1–3% harmful | 97% accuracy from "safe" baseline |
What to Report Instead
from sklearn.metrics import (
accuracy_score, roc_auc_score, average_precision_score,
f1_score, classification_report
)
y_proba = model.predict_proba(X_test)[:, 1]
y_pred = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f} ← supplement with others")
print(f"AUC-ROC: {roc_auc_score(y_test, y_proba):.3f} ← threshold-independent")
print(f"AUC-PR (avg prec): {average_precision_score(y_test, y_proba):.3f} ← for imbalanced data")
print(f"F1 (minority class):{f1_score(y_test, y_pred):.3f} ← harmonic mean of precision+recall")
print()
print(classification_report(y_test, y_pred))The Key Insight: Cost Asymmetry
In medicine, missing a positive case (false negative) is often much worse
than a false alarm (false positive).
Example: sepsis classifier
False negative: patient with sepsis not flagged → delayed treatment → higher mortality
False positive: healthy patient flagged → extra workup → wasted resources
→ Optimize for recall (catch as many positives as possible),
accept lower precision (some false alarms are OK)
→ AUC-ROC and recall are the right metrics; accuracy is irrelevant
LLM safety classifier:
False negative: harmful output passes filter → user harm, reputational damage
False positive: safe output blocked → mild inconvenience
→ Again: prioritize recall, report F1 separately for each classInterview Answer Template
Q: Why isn't accuracy always a good metric?
Accuracy is misleading whenever class imbalance exists. A model that always predicts the majority class gets accuracy equal to the majority class frequency — 90% accuracy on a dataset with 90% negatives — while catching zero positive cases. The baseline to beat is always the majority-class dummy: if your model barely improves on that, it hasn't learned anything useful about the minority class. Instead, use metrics that evaluate minority-class performance: AUC-ROC (threshold-independent ranking quality), average precision / precision-recall AUC (especially for severe imbalance), F1-score (harmonic mean of precision and recall), or recall directly if false negatives are more costly than false positives. In clinical ML, accuracy almost never tells the right story — a sepsis model with 95% accuracy that misses 80% of sepsis cases is worse than useless.