Systematic ML Debugging
A reproducible, step-by-step framework for debugging ML models: error taxonomy, the debugging ladder, tools for each layer, and a checklist for both development and production failures.
The ML Error Taxonomy
Layer 1 — Data errors
Missing values not handled
Feature-label mismatch (wrong row order after merge)
Label leakage (future data in features)
Encoding errors (OneHotEncoder saw categories in test not in training)
Wrong target column
Layer 2 — Pipeline errors
Scaler fitted on all data (leakage)
Imputer not inside CV fold
Different preprocessing in training vs inference
Dependency version mismatch
Layer 3 — Model errors
Wrong model class for the task (regression output for classification)
Severely imbalanced classes with no correction
Regularization too strong → all predictions = majority class
Learning rate too large → NaN loss
Layer 4 — Evaluation errors
Evaluating on training data
Wrong metric for the task (accuracy on imbalanced data)
Not comparing to baseline
Threshold at 0.5 when prevalence is 5%
Layer 5 — Production errors
Data drift
Schema change
Pipeline serialization mismatch
Delayed ground truth not accounted forThe Debugging Ladder
Work from bottom to top — fix lower layers before investigating higher ones.
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
import numpy as np
def run_debugging_ladder(X_train, y_train, feature_names):
"""
Run the debugging ladder: each rung must pass before moving up.
"""
print("=== ML Debugging Ladder ===\n")
# Rung 0: Raw data checks
print("Rung 0: Data Sanity Checks")
assert X_train.shape[0] == len(y_train), "Feature-label length mismatch"
nan_pct = np.isnan(X_train).sum() / X_train.size
print(f" NaN fraction: {nan_pct:.3%}")
print(f" Class distribution: {dict(zip(*np.unique(y_train, return_counts=True)))}")
print(f" Feature range: [{X_train.min():.2f}, {X_train.max():.2f}]")
print()
# Rung 1: Baseline
print("Rung 1: Baseline (Dummy Classifier)")
dummy = DummyClassifier(strategy="most_frequent")
scores = cross_val_score(dummy, X_train, y_train, cv=5, scoring="roc_auc")
print(f" Dummy AUC: {scores.mean():.3f} (target: your model should be higher)")
print()
# Rung 2: Simple model (verify pipeline)
print("Rung 2: Simple Model (Logistic Regression)")
pipeline_simple = Pipeline([
("imputer", SimpleImputer(strategy="median")),
("scaler", StandardScaler()),
("model", LogisticRegression(max_iter=1000)),
])
scores_lr = cross_val_score(pipeline_simple, X_train, y_train, cv=5, scoring="roc_auc")
print(f" Logistic Regression AUC: {scores_lr.mean():.3f} ± {scores_lr.std():.3f}")
if scores_lr.mean() < scores[0] + 0.02:
print(" WARNING: barely above dummy — check for data or label issues")
print()
# Rung 3: Non-linear model
print("Rung 3: Non-Linear Model (Random Forest)")
pipeline_rf = Pipeline([
("imputer", SimpleImputer(strategy="median")),
("model", RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)),
])
scores_rf = cross_val_score(pipeline_rf, X_train, y_train, cv=5, scoring="roc_auc")
print(f" Random Forest AUC: {scores_rf.mean():.3f} ± {scores_rf.std():.3f}")
improvement = scores_rf.mean() - scores_lr.mean()
print(f" Improvement over LR: {improvement:.3f}")
if improvement < 0:
print(" RF worse than LR → check for large feature scale differences")
print()
return {
"dummy_auc": scores.mean(),
"lr_auc": scores_lr.mean(),
"rf_auc": scores_rf.mean(),
}
results = run_debugging_ladder(X_train, y_train, feature_names)Error Analysis: What Is the Model Getting Wrong?
from sklearn.metrics import confusion_matrix
import numpy as np
import pandas as pd
def analyze_errors(model, X_val, y_val, feature_names, X_val_df=None):
"""
Identify which samples the model gets wrong and why.
"""
y_pred = model.predict(X_val)
y_proba = model.predict_proba(X_val)[:, 1]
# Where are the errors?
fn_mask = (y_val == 1) & (y_pred == 0) # false negatives
fp_mask = (y_val == 0) & (y_pred == 1) # false positives
print(f"False Negatives: {fn_mask.sum()} ({fn_mask.mean():.1%} of all positives)")
print(f"False Positives: {fp_mask.sum()} ({fp_mask.mean():.1%} of all negatives)")
if X_val_df is not None:
# What's different about the errors?
print("\nFeature comparison (correct vs FN):")
correct_pos = X_val_df[y_val == 1][~fn_mask[y_val == 1]]
wrong_pos = X_val_df[y_val == 1][fn_mask[y_val == 1]]
for col in X_val_df.columns:
correct_mean = correct_pos[col].mean()
wrong_mean = wrong_pos[col].mean()
diff = wrong_mean - correct_mean
if abs(diff) > 0.5 * correct_pos[col].std():
print(f" {col}: correct_pos mean={correct_mean:.2f}, FN mean={wrong_mean:.2f}")
# Confidence of errors
print(f"\nMean confidence of FN: {y_proba[fn_mask].mean():.3f}")
print(f"Mean confidence of FP: {y_proba[fp_mask].mean():.3f}")
# Low-confidence errors → model is uncertain (might be fixable with more data)
# High-confidence errors → systematic mistake (needs investigation)Reproducing Production Bugs Locally
# When a production bug is reported:
# 1. Get the exact input that caused the issue
# 2. Run it through the saved pipeline locally
# 3. Compare to the production output
import joblib
import numpy as np
# Load the exact same pipeline as production
pipeline = joblib.load("readmission_pipeline_v2.3.joblib")
# Input from the bug report (exact values)
bug_report_input = {
"age": 72,
"weight_kg": 88,
"serum_creatinine": 3.4,
"hba1c": None, # was missing in the report
"num_medications": 14,
"prior_admissions": 4,
}
import pandas as pd
X_debug = pd.DataFrame([bug_report_input])
# Run through pipeline
try:
prob = pipeline.predict_proba(X_debug)[0, 1]
print(f"Prediction: {prob:.3f}")
except Exception as e:
print(f"Pipeline error: {e}")
# This error is the bug — trace the specific preprocessing step that failedDebugging Checklist
DEVELOPMENT CHECKLIST:
Data layer:
□ No NaN/Inf in features
□ Feature and label arrays have same length
□ No target leakage (check with depth-1 tree)
□ Train/val/test splits don't overlap
□ Preprocessing fitted only on training folds (use Pipeline)
Model layer:
□ Model beats dummy classifier by a meaningful margin
□ Training loss decreases (model is learning)
□ Train/val AUC gap is acceptable (under 0.10)
□ Threshold tuned on val set, not default 0.5
Evaluation layer:
□ Using the right metric (AUC-PR for imbalanced, not just accuracy)
□ CV scores used for model comparison, not test set
□ Test set evaluated exactly once
PRODUCTION CHECKLIST:
□ Fitted pipeline serialized and loaded (not recreated)
□ Schema validation at prediction endpoint
□ Prediction distribution monitored vs training baseline
□ Feature drift monitored for key features (PSI/KS)
□ Rolling AUC monitored as ground truth arrives
□ OOD detection for extreme input values
□ Alert thresholds defined and tested
□ Retrain trigger defined (AUC drop > 0.05, PSI > 0.25)Interview Answer Template
Q: Walk me through how you'd systematically debug an underperforming ML model.
I follow a layered approach — fix the bottom layer before investigating higher ones. I start with a data sanity check: NaN/Inf counts, feature-label alignment, class distribution, and a leakage check (depth-1 tree getting suspiciously high AUC is a red flag). Then I establish a baseline with a dummy classifier — the model must beat this meaningfully. Next, I run a simple linear model to verify the pipeline is correct (imputer, scaler, cross-validation setup). If the linear model barely beats the dummy, the problem is data or pipeline, not model capacity. Only then do I try a non-linear model. For each failure mode, I have a specific fix: NaN → add imputer; all same predictions → check class imbalance or over-regularization; NaN loss → check learning rate and input scale; performance plateau → check for target leakage. For production failures, I add checks specific to deployment: schema validation, feature drift detection, and pipeline serialization mismatch. The rule is always: data first, pipeline second, model third.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.