Machine Learning Foundations · Lesson 41 of 70

Why Accuracy Alone is Not Enough

The Accuracy Trap

Accuracy is the fraction of correct predictions. It sounds perfect. It isn't.

Python

from sklearn.metrics import accuracy_score
import numpy as np

# Clinical dataset: 30-day readmission for diabetic patients
# 90% are NOT readmitted
y_true = np.array([0] * 900 + [1] * 100)   # 900 negative, 100 positive

# A model that predicts "not readmitted" for everyone
y_always_negative = np.zeros(1000, dtype=int)

acc = accuracy_score(y_true, y_always_negative)
print(f"Accuracy of always-negative model: {acc:.2%}")   # 90.0%

# This model has 90% accuracy — and is completely useless
# It never identifies a single patient at risk of readmission

Why Imbalanced Classes Break Accuracy

Dataset: 90% class 0 (no readmission), 10% class 1 (readmission)

Always-predict-0 model:
  Accuracy = 900/1000 = 90%  ← sounds great
  Recall for class 1 = 0/100 = 0%  ← catches zero at-risk patients

A good model:
  Accuracy = 830/1000 = 83%  ← lower accuracy
  Recall for class 1 = 75/100 = 75%  ← catches 75% of at-risk patients

The "worse" model is dramatically more useful.

Python

from sklearn.metrics import classification_report

y_true = np.array([0]*900 + [1]*100)
y_baseline = np.zeros(1000, dtype=int)
y_good_model = np.concatenate([np.zeros(870), np.ones(30), np.zeros(25), np.ones(75)])

print("Always-negative model:")
print(classification_report(y_true, y_baseline, target_names=["no_readmit", "readmit"]))

print("Better model:")
print(classification_report(y_true, y_good_model.astype(int), target_names=["no_readmit", "readmit"]))

The Baseline Comparison

Python

from sklearn.dummy import DummyClassifier
from sklearn.model_selection import cross_val_score

# Always compare your model to the majority-class baseline
dummy = DummyClassifier(strategy="most_frequent")
model_scores = cross_val_score(your_model, X, y, cv=5, scoring="accuracy")
dummy_scores = cross_val_score(dummy,      X, y, cv=5, scoring="accuracy")

print(f"Your model accuracy:    {model_scores.mean():.2%}")
print(f"Baseline accuracy:      {dummy_scores.mean():.2%}")
print(f"Improvement over baseline: {(model_scores.mean() - dummy_scores.mean()):.2%}")

# If improvement is small, your model has barely learned anything

When Accuracy IS Useful

Accuracy is a reasonable metric when:
  - Classes are roughly balanced (±10%)
  - All mistakes cost the same (false positive = false negative in severity)
  - You're doing multi-class classification with roughly equal class frequencies

Accuracy is misleading when:
  - Significant class imbalance exists
  - One type of error is much more costly than the other
  - You need to understand model behavior on the minority class

Common Imbalanced Clinical Scenarios

| Scenario | Typical Imbalance | Why Accuracy Fails | |---|---|---| | 30-day readmission | 10–20% positive | 80-90% accuracy from "never readmitted" baseline | | Rare drug adverse event | 1–5% positive | 95-99% accuracy from "no event" baseline | | Sepsis early warning | 5–15% positive | High accuracy from always-negative | | Drug-drug interaction flag | 2–8% positive | Almost all interactions are irrelevant | | LLM safety classifier | 1–3% harmful | 97% accuracy from "safe" baseline |

What to Report Instead

Python

from sklearn.metrics import (
    accuracy_score, roc_auc_score, average_precision_score,
    f1_score, classification_report
)

y_proba = model.predict_proba(X_test)[:, 1]
y_pred  = model.predict(X_test)

print(f"Accuracy:           {accuracy_score(y_test, y_pred):.3f}  ← supplement with others")
print(f"AUC-ROC:            {roc_auc_score(y_test, y_proba):.3f}  ← threshold-independent")
print(f"AUC-PR (avg prec):  {average_precision_score(y_test, y_proba):.3f}  ← for imbalanced data")
print(f"F1 (minority class):{f1_score(y_test, y_pred):.3f}  ← harmonic mean of precision+recall")
print()
print(classification_report(y_test, y_pred))

The Key Insight: Cost Asymmetry

In medicine, missing a positive case (false negative) is often much worse
than a false alarm (false positive).

Example: sepsis classifier
  False negative: patient with sepsis not flagged → delayed treatment → higher mortality
  False positive: healthy patient flagged → extra workup → wasted resources

  → Optimize for recall (catch as many positives as possible),
    accept lower precision (some false alarms are OK)
  → AUC-ROC and recall are the right metrics; accuracy is irrelevant

LLM safety classifier:
  False negative: harmful output passes filter → user harm, reputational damage
  False positive: safe output blocked → mild inconvenience
  → Again: prioritize recall, report F1 separately for each class

Interview Answer Template

Q: Why isn't accuracy always a good metric?

Accuracy is misleading whenever class imbalance exists. A model that always predicts the majority class gets accuracy equal to the majority class frequency — 90% accuracy on a dataset with 90% negatives — while catching zero positive cases. The baseline to beat is always the majority-class dummy: if your model barely improves on that, it hasn't learned anything useful about the minority class. Instead, use metrics that evaluate minority-class performance: AUC-ROC (threshold-independent ranking quality), average precision / precision-recall AUC (especially for severe imbalance), F1-score (harmonic mean of precision and recall), or recall directly if false negatives are more costly than false positives. In clinical ML, accuracy almost never tells the right story — a sepsis model with 95% accuracy that misses 80% of sepsis cases is worse than useless.

Interview: Feature Engineering Walk-Through

Next Lesson

Precision and Recall Explained