Learnixo
Back to blog
AI Systemsintermediate

Reading a Confusion Matrix

Step-by-step guide to reading confusion matrices: binary and multi-class, row vs column orientation, normalization, identifying systematic errors, and what each quadrant reveals about model behavior.

Asma Hafeez KhanMay 16, 20265 min read
Machine LearningConfusion MatrixEvaluationClassificationInterview
Share:š•

The Standard Layout

Conventions vary between tools and papers, but sklearn uses this layout:

                    Predicted Negative   Predicted Positive
Actual Negative  |        TN           |        FP          |
Actual Positive  |        FN           |        TP          |

Rows = actual class
Columns = predicted class
Diagonal = correct predictions
Off-diagonal = errors

Reading a Binary Matrix Step by Step

Python
from sklearn.metrics import confusion_matrix
import numpy as np

# Warfarin bleeding risk model
# Positive = high bleeding risk (needs dose reduction)
# Negative = normal risk

y_true = np.array([0]*180 + [1]*20)    # 200 patients, 20 high-risk
y_pred = np.array([0]*168 + [1]*12 + [0]*6 + [1]*14)
#                  TN=168   FP=12   FN=6    TP=14

cm = confusion_matrix(y_true, y_pred)
tn, fp, fn, tp = cm.ravel()
print(cm)
# [[168  12]
#  [  6  14]]

# Step 1: Read the diagonal (correct predictions)
print(f"Correctly classified: TN={tn} (safe patients cleared) + TP={tp} (high-risk flagged) = {tn+tp}")

# Step 2: Read the off-diagonal (errors)
print(f"\nErrors:")
print(f"  FP={fp}: {fp} safe patients flagged as high-risk (unnecessary caution)")
print(f"  FN={fn}: {fn} high-risk patients missed (most dangerous error)")

# Step 3: Total counts
total = tn + fp + fn + tp
print(f"\nTotals:")
print(f"  All actual negatives: {tn+fp} ({(tn+fp)/total:.0%})")
print(f"  All actual positives: {fn+tp} ({(fn+tp)/total:.0%})")
print(f"  All predicted negative: {tn+fn} ({(tn+fn)/total:.0%})")
print(f"  All predicted positive: {fp+tp} ({(fp+tp)/total:.0%})")

What Each Quadrant Means Clinically

Warfarin bleeding risk model:

  TN (168): Patient is low-risk, model says low-risk
            → Safe to maintain current dose
            → Correct and no action needed

  TP (14):  Patient is high-risk, model says high-risk
            → Dose reduction recommended
            → Correct — clinical intervention triggered

  FP (12):  Patient is low-risk, model says high-risk
            → Unnecessary dose reduction recommended
            → Outcome: sub-therapeutic anticoagulation, possible clotting event
            → Costly — but at least an action was taken

  FN (6):   Patient is high-risk, model says low-risk
            → No intervention — dose maintained
            → Outcome: patient at risk of bleeding event
            → Most dangerous: no safety net for a patient who needs protection

Normalization Modes

Python
from sklearn.metrics import confusion_matrix

# Raw counts — useful for understanding scale
cm_raw = confusion_matrix(y_true, y_pred)
print("Raw:\n", cm_raw)

# Normalized by actual class (rows sum to 1.0)
# Tells you: of all actual positives, what fraction was caught?
cm_true = confusion_matrix(y_true, y_pred, normalize="true")
print("Normalized by actual (rows):\n", cm_true.round(3))
# [[0.933, 0.067]  → 93.3% of negatives correctly cleared; 6.7% falsely flagged
#  [0.300, 0.700]] → 70.0% of high-risk patients caught; 30.0% missed

# Normalized by predicted class (columns sum to 1.0)
# Tells you: of all patients flagged as positive, what fraction truly were?
cm_pred = confusion_matrix(y_true, y_pred, normalize="pred")
print("Normalized by predicted (cols):\n", cm_pred.round(3))

# Normalized by total — every cell is fraction of all predictions
cm_all = confusion_matrix(y_true, y_pred, normalize="all")
print("Normalized by total:\n", cm_all.round(3))

Reading a Multi-Class Matrix

Python
# Drug category classification: 4 classes
classes = ["anticoagulant", "antidiabetic", "antihypertensive", "antibiotic"]

y_true = [0, 1, 2, 3, 0, 1, 2, 3, 0, 0, 1, 2, 3, 3, 1, 2]
y_pred = [0, 1, 2, 2, 0, 1, 3, 3, 1, 0, 1, 2, 3, 2, 1, 2]

cm = confusion_matrix(y_true, y_pred)

print("Rows = Actual, Columns = Predicted")
header = "           " + "  ".join(f"{c[:6]:>8}" for c in classes)
print(header)
for i, (row, name) in enumerate(zip(cm, classes)):
    vals = "  ".join(f"{v:>8}" for v in row)
    print(f"{name[:12]:>12}: {vals}")

# How to read:
# Row 0 (anticoagulant): cm[0,0]=correct anticoag, cm[0,1]=anticoag predicted as antidiabetic, ...
# Row 2 (antihypertensive): cm[2,3] — antihypertensives predicted as antibiotics
# This is suspicious: investigate feature overlap between these two classes

Identifying Systematic Errors

Python
import numpy as np

def analyze_confusion_matrix(cm: np.ndarray, class_names: list) -> None:
    n = len(class_names)
    total = cm.sum()

    print("=== Confusion Matrix Analysis ===\n")

    # Per-class accuracy (diagonal / row sum)
    print("Per-class accuracy (recall):")
    for i in range(n):
        row_total = cm[i, :].sum()
        acc = cm[i, i] / row_total if row_total > 0 else 0
        print(f"  {class_names[i]:<20}: {acc:.2%} ({cm[i,i]}/{row_total})")

    # Most confused pairs (off-diagonal)
    print("\nTop confused pairs (off-diagonal errors):")
    errors = []
    for i in range(n):
        for j in range(n):
            if i != j and cm[i, j] > 0:
                errors.append((class_names[i], class_names[j], cm[i, j]))

    for actual, predicted, count in sorted(errors, key=lambda x: -x[2])[:5]:
        print(f"  {actual} → {predicted}: {count} times")

analyze_confusion_matrix(np.array(confusion_matrix(y_true, y_pred)), classes)

Threshold Effect on the Matrix

Python
import numpy as np
from sklearn.metrics import confusion_matrix

y_proba = model.predict_proba(X_test)[:, 1]

# Lower threshold → more positive predictions → TP and FP increase, FN decreases
print(f"{'Threshold':>10}  {'TN':>6}  {'FP':>6}  {'FN':>6}  {'TP':>6}")
print("-" * 40)
for threshold in [0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8]:
    y_pred_t = (y_proba >= threshold).astype(int)
    tn, fp, fn, tp = confusion_matrix(y_test, y_pred_t).ravel()
    print(f"{threshold:>10.1f}  {tn:>6}  {fp:>6}  {fn:>6}  {tp:>6}")

# As threshold drops: FN shrinks (fewer missed positives), FP grows (more false alarms)
# Pick threshold based on which error is more costly

Interview Answer Template

Q: Walk me through how to read a confusion matrix.

A confusion matrix is a grid of actual vs predicted labels. For binary classification: the top-left is true negatives (correctly predicted no event), top-right is false positives (false alarms), bottom-left is false negatives (missed events), and bottom-right is true positives (correctly flagged events). The diagonal is always what the model got right — the off-diagonal shows the errors. The key question is: which off-diagonal cell is larger, and what does that mean clinically? In a sepsis model, FN matters most — missing sepsis patients. In an alert system, FP matters — alert fatigue. I normalize by actual class (rows) to get recall per class, and by predicted class (columns) to get precision per class. For multi-class, I look at the off-diagonal for systematic confusion patterns between specific classes, which often reveals whether the model lacks enough features to distinguish them.

Enjoyed this article?

Explore the AI Systems learning path for more.

Found this helpful?

Share:š•

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.