Systematic ML Debugging

The ML Error Taxonomy

Layer 1 — Data errors
  Missing values not handled
  Feature-label mismatch (wrong row order after merge)
  Label leakage (future data in features)
  Encoding errors (OneHotEncoder saw categories in test not in training)
  Wrong target column

Layer 2 — Pipeline errors
  Scaler fitted on all data (leakage)
  Imputer not inside CV fold
  Different preprocessing in training vs inference
  Dependency version mismatch

Layer 3 — Model errors
  Wrong model class for the task (regression output for classification)
  Severely imbalanced classes with no correction
  Regularization too strong → all predictions = majority class
  Learning rate too large → NaN loss

Layer 4 — Evaluation errors
  Evaluating on training data
  Wrong metric for the task (accuracy on imbalanced data)
  Not comparing to baseline
  Threshold at 0.5 when prevalence is 5%

Layer 5 — Production errors
  Data drift
  Schema change
  Pipeline serialization mismatch
  Delayed ground truth not accounted for

The Debugging Ladder

Work from bottom to top — fix lower layers before investigating higher ones.

Python

from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
import numpy as np

def run_debugging_ladder(X_train, y_train, feature_names):
    """
    Run the debugging ladder: each rung must pass before moving up.
    """
    print("=== ML Debugging Ladder ===\n")

    # Rung 0: Raw data checks
    print("Rung 0: Data Sanity Checks")
    assert X_train.shape[0] == len(y_train), "Feature-label length mismatch"
    nan_pct = np.isnan(X_train).sum() / X_train.size
    print(f"  NaN fraction: {nan_pct:.3%}")
    print(f"  Class distribution: {dict(zip(*np.unique(y_train, return_counts=True)))}")
    print(f"  Feature range: [{X_train.min():.2f}, {X_train.max():.2f}]")
    print()

    # Rung 1: Baseline
    print("Rung 1: Baseline (Dummy Classifier)")
    dummy = DummyClassifier(strategy="most_frequent")
    scores = cross_val_score(dummy, X_train, y_train, cv=5, scoring="roc_auc")
    print(f"  Dummy AUC: {scores.mean():.3f} (target: your model should be higher)")
    print()

    # Rung 2: Simple model (verify pipeline)
    print("Rung 2: Simple Model (Logistic Regression)")
    pipeline_simple = Pipeline([
        ("imputer", SimpleImputer(strategy="median")),
        ("scaler",  StandardScaler()),
        ("model",   LogisticRegression(max_iter=1000)),
    ])
    scores_lr = cross_val_score(pipeline_simple, X_train, y_train, cv=5, scoring="roc_auc")
    print(f"  Logistic Regression AUC: {scores_lr.mean():.3f} ± {scores_lr.std():.3f}")
    if scores_lr.mean() < scores[0] + 0.02:
        print("  WARNING: barely above dummy — check for data or label issues")
    print()

    # Rung 3: Non-linear model
    print("Rung 3: Non-Linear Model (Random Forest)")
    pipeline_rf = Pipeline([
        ("imputer", SimpleImputer(strategy="median")),
        ("model", RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)),
    ])
    scores_rf = cross_val_score(pipeline_rf, X_train, y_train, cv=5, scoring="roc_auc")
    print(f"  Random Forest AUC: {scores_rf.mean():.3f} ± {scores_rf.std():.3f}")
    improvement = scores_rf.mean() - scores_lr.mean()
    print(f"  Improvement over LR: {improvement:.3f}")
    if improvement < 0:
        print("  RF worse than LR → check for large feature scale differences")
    print()

    return {
        "dummy_auc": scores.mean(),
        "lr_auc":    scores_lr.mean(),
        "rf_auc":    scores_rf.mean(),
    }

results = run_debugging_ladder(X_train, y_train, feature_names)

Error Analysis: What Is the Model Getting Wrong?

Python

from sklearn.metrics import confusion_matrix
import numpy as np
import pandas as pd

def analyze_errors(model, X_val, y_val, feature_names, X_val_df=None):
    """
    Identify which samples the model gets wrong and why.
    """
    y_pred = model.predict(X_val)
    y_proba = model.predict_proba(X_val)[:, 1]

    # Where are the errors?
    fn_mask = (y_val == 1) & (y_pred == 0)  # false negatives
    fp_mask = (y_val == 0) & (y_pred == 1)  # false positives

    print(f"False Negatives: {fn_mask.sum()} ({fn_mask.mean():.1%} of all positives)")
    print(f"False Positives: {fp_mask.sum()} ({fp_mask.mean():.1%} of all negatives)")

    if X_val_df is not None:
        # What's different about the errors?
        print("\nFeature comparison (correct vs FN):")
        correct_pos = X_val_df[y_val == 1][~fn_mask[y_val == 1]]
        wrong_pos   = X_val_df[y_val == 1][fn_mask[y_val == 1]]

        for col in X_val_df.columns:
            correct_mean = correct_pos[col].mean()
            wrong_mean   = wrong_pos[col].mean()
            diff = wrong_mean - correct_mean
            if abs(diff) > 0.5 * correct_pos[col].std():
                print(f"  {col}: correct_pos mean={correct_mean:.2f}, FN mean={wrong_mean:.2f}")

    # Confidence of errors
    print(f"\nMean confidence of FN: {y_proba[fn_mask].mean():.3f}")
    print(f"Mean confidence of FP: {y_proba[fp_mask].mean():.3f}")
    # Low-confidence errors → model is uncertain (might be fixable with more data)
    # High-confidence errors → systematic mistake (needs investigation)

Reproducing Production Bugs Locally

Python

# When a production bug is reported:
# 1. Get the exact input that caused the issue
# 2. Run it through the saved pipeline locally
# 3. Compare to the production output

import joblib
import numpy as np

# Load the exact same pipeline as production
pipeline = joblib.load("readmission_pipeline_v2.3.joblib")

# Input from the bug report (exact values)
bug_report_input = {
    "age": 72,
    "weight_kg": 88,
    "serum_creatinine": 3.4,
    "hba1c": None,          # was missing in the report
    "num_medications": 14,
    "prior_admissions": 4,
}

import pandas as pd
X_debug = pd.DataFrame([bug_report_input])

# Run through pipeline
try:
    prob = pipeline.predict_proba(X_debug)[0, 1]
    print(f"Prediction: {prob:.3f}")
except Exception as e:
    print(f"Pipeline error: {e}")
    # This error is the bug — trace the specific preprocessing step that failed

Debugging Checklist

DEVELOPMENT CHECKLIST:

Data layer:
□ No NaN/Inf in features
□ Feature and label arrays have same length
□ No target leakage (check with depth-1 tree)
□ Train/val/test splits don't overlap
□ Preprocessing fitted only on training folds (use Pipeline)

Model layer:
□ Model beats dummy classifier by a meaningful margin
□ Training loss decreases (model is learning)
□ Train/val AUC gap is acceptable (under 0.10)
□ Threshold tuned on val set, not default 0.5

Evaluation layer:
□ Using the right metric (AUC-PR for imbalanced, not just accuracy)
□ CV scores used for model comparison, not test set
□ Test set evaluated exactly once

PRODUCTION CHECKLIST:

□ Fitted pipeline serialized and loaded (not recreated)
□ Schema validation at prediction endpoint
□ Prediction distribution monitored vs training baseline
□ Feature drift monitored for key features (PSI/KS)
□ Rolling AUC monitored as ground truth arrives
□ OOD detection for extreme input values
□ Alert thresholds defined and tested
□ Retrain trigger defined (AUC drop > 0.05, PSI > 0.25)

Interview Answer Template

Q: Walk me through how you'd systematically debug an underperforming ML model.

I follow a layered approach — fix the bottom layer before investigating higher ones. I start with a data sanity check: NaN/Inf counts, feature-label alignment, class distribution, and a leakage check (depth-1 tree getting suspiciously high AUC is a red flag). Then I establish a baseline with a dummy classifier — the model must beat this meaningfully. Next, I run a simple linear model to verify the pipeline is correct (imputer, scaler, cross-validation setup). If the linear model barely beats the dummy, the problem is data or pipeline, not model capacity. Only then do I try a non-linear model. For each failure mode, I have a specific fix: NaN → add imputer; all same predictions → check class imbalance or over-regularization; NaN loss → check learning rate and input scale; performance plateau → check for target leakage. For production failures, I add checks specific to deployment: schema validation, feature drift detection, and pipeline serialization mismatch. The rule is always: data first, pipeline second, model third.

Systematic ML Debugging

The ML Error Taxonomy

The Debugging Ladder

Error Analysis: What Is the Model Getting Wrong?

Reproducing Production Bugs Locally

Debugging Checklist

Interview Answer Template

Enjoyed this article?

Leave a comment