Reading a Confusion Matrix
Step-by-step guide to reading confusion matrices: binary and multi-class, row vs column orientation, normalization, identifying systematic errors, and what each quadrant reveals about model behavior.
The Standard Layout
Conventions vary between tools and papers, but sklearn uses this layout:
Predicted Negative Predicted Positive
Actual Negative | TN | FP |
Actual Positive | FN | TP |
Rows = actual class
Columns = predicted class
Diagonal = correct predictions
Off-diagonal = errorsReading a Binary Matrix Step by Step
from sklearn.metrics import confusion_matrix
import numpy as np
# Warfarin bleeding risk model
# Positive = high bleeding risk (needs dose reduction)
# Negative = normal risk
y_true = np.array([0]*180 + [1]*20) # 200 patients, 20 high-risk
y_pred = np.array([0]*168 + [1]*12 + [0]*6 + [1]*14)
# TN=168 FP=12 FN=6 TP=14
cm = confusion_matrix(y_true, y_pred)
tn, fp, fn, tp = cm.ravel()
print(cm)
# [[168 12]
# [ 6 14]]
# Step 1: Read the diagonal (correct predictions)
print(f"Correctly classified: TN={tn} (safe patients cleared) + TP={tp} (high-risk flagged) = {tn+tp}")
# Step 2: Read the off-diagonal (errors)
print(f"\nErrors:")
print(f" FP={fp}: {fp} safe patients flagged as high-risk (unnecessary caution)")
print(f" FN={fn}: {fn} high-risk patients missed (most dangerous error)")
# Step 3: Total counts
total = tn + fp + fn + tp
print(f"\nTotals:")
print(f" All actual negatives: {tn+fp} ({(tn+fp)/total:.0%})")
print(f" All actual positives: {fn+tp} ({(fn+tp)/total:.0%})")
print(f" All predicted negative: {tn+fn} ({(tn+fn)/total:.0%})")
print(f" All predicted positive: {fp+tp} ({(fp+tp)/total:.0%})")What Each Quadrant Means Clinically
Warfarin bleeding risk model:
TN (168): Patient is low-risk, model says low-risk
ā Safe to maintain current dose
ā Correct and no action needed
TP (14): Patient is high-risk, model says high-risk
ā Dose reduction recommended
ā Correct ā clinical intervention triggered
FP (12): Patient is low-risk, model says high-risk
ā Unnecessary dose reduction recommended
ā Outcome: sub-therapeutic anticoagulation, possible clotting event
ā Costly ā but at least an action was taken
FN (6): Patient is high-risk, model says low-risk
ā No intervention ā dose maintained
ā Outcome: patient at risk of bleeding event
ā Most dangerous: no safety net for a patient who needs protectionNormalization Modes
from sklearn.metrics import confusion_matrix
# Raw counts ā useful for understanding scale
cm_raw = confusion_matrix(y_true, y_pred)
print("Raw:\n", cm_raw)
# Normalized by actual class (rows sum to 1.0)
# Tells you: of all actual positives, what fraction was caught?
cm_true = confusion_matrix(y_true, y_pred, normalize="true")
print("Normalized by actual (rows):\n", cm_true.round(3))
# [[0.933, 0.067] ā 93.3% of negatives correctly cleared; 6.7% falsely flagged
# [0.300, 0.700]] ā 70.0% of high-risk patients caught; 30.0% missed
# Normalized by predicted class (columns sum to 1.0)
# Tells you: of all patients flagged as positive, what fraction truly were?
cm_pred = confusion_matrix(y_true, y_pred, normalize="pred")
print("Normalized by predicted (cols):\n", cm_pred.round(3))
# Normalized by total ā every cell is fraction of all predictions
cm_all = confusion_matrix(y_true, y_pred, normalize="all")
print("Normalized by total:\n", cm_all.round(3))Reading a Multi-Class Matrix
# Drug category classification: 4 classes
classes = ["anticoagulant", "antidiabetic", "antihypertensive", "antibiotic"]
y_true = [0, 1, 2, 3, 0, 1, 2, 3, 0, 0, 1, 2, 3, 3, 1, 2]
y_pred = [0, 1, 2, 2, 0, 1, 3, 3, 1, 0, 1, 2, 3, 2, 1, 2]
cm = confusion_matrix(y_true, y_pred)
print("Rows = Actual, Columns = Predicted")
header = " " + " ".join(f"{c[:6]:>8}" for c in classes)
print(header)
for i, (row, name) in enumerate(zip(cm, classes)):
vals = " ".join(f"{v:>8}" for v in row)
print(f"{name[:12]:>12}: {vals}")
# How to read:
# Row 0 (anticoagulant): cm[0,0]=correct anticoag, cm[0,1]=anticoag predicted as antidiabetic, ...
# Row 2 (antihypertensive): cm[2,3] ā antihypertensives predicted as antibiotics
# This is suspicious: investigate feature overlap between these two classesIdentifying Systematic Errors
import numpy as np
def analyze_confusion_matrix(cm: np.ndarray, class_names: list) -> None:
n = len(class_names)
total = cm.sum()
print("=== Confusion Matrix Analysis ===\n")
# Per-class accuracy (diagonal / row sum)
print("Per-class accuracy (recall):")
for i in range(n):
row_total = cm[i, :].sum()
acc = cm[i, i] / row_total if row_total > 0 else 0
print(f" {class_names[i]:<20}: {acc:.2%} ({cm[i,i]}/{row_total})")
# Most confused pairs (off-diagonal)
print("\nTop confused pairs (off-diagonal errors):")
errors = []
for i in range(n):
for j in range(n):
if i != j and cm[i, j] > 0:
errors.append((class_names[i], class_names[j], cm[i, j]))
for actual, predicted, count in sorted(errors, key=lambda x: -x[2])[:5]:
print(f" {actual} ā {predicted}: {count} times")
analyze_confusion_matrix(np.array(confusion_matrix(y_true, y_pred)), classes)Threshold Effect on the Matrix
import numpy as np
from sklearn.metrics import confusion_matrix
y_proba = model.predict_proba(X_test)[:, 1]
# Lower threshold ā more positive predictions ā TP and FP increase, FN decreases
print(f"{'Threshold':>10} {'TN':>6} {'FP':>6} {'FN':>6} {'TP':>6}")
print("-" * 40)
for threshold in [0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8]:
y_pred_t = (y_proba >= threshold).astype(int)
tn, fp, fn, tp = confusion_matrix(y_test, y_pred_t).ravel()
print(f"{threshold:>10.1f} {tn:>6} {fp:>6} {fn:>6} {tp:>6}")
# As threshold drops: FN shrinks (fewer missed positives), FP grows (more false alarms)
# Pick threshold based on which error is more costlyInterview Answer Template
Q: Walk me through how to read a confusion matrix.
A confusion matrix is a grid of actual vs predicted labels. For binary classification: the top-left is true negatives (correctly predicted no event), top-right is false positives (false alarms), bottom-left is false negatives (missed events), and bottom-right is true positives (correctly flagged events). The diagonal is always what the model got right ā the off-diagonal shows the errors. The key question is: which off-diagonal cell is larger, and what does that mean clinically? In a sepsis model, FN matters most ā missing sepsis patients. In an alert system, FP matters ā alert fatigue. I normalize by actual class (rows) to get recall per class, and by predicted class (columns) to get precision per class. For multi-class, I look at the off-diagonal for systematic confusion patterns between specific classes, which often reveals whether the model lacks enough features to distinguish them.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.