Learnixo
Back to blog
AI Systemsadvanced

Interview: ML Debugging Scenario

Interview walk-through: diagnose a production model that was working but suddenly dropped from AUC 0.87 to 0.61 — covering systematic debugging, root cause identification, and remediation.

Asma Hafeez KhanMay 16, 20265 min read
Machine LearningInterviewDebuggingProductionMLOpsClinical AI
Share:š•

The Scenario

A readmission prediction model has been running in production for 4 months with AUC around 0.85. Three weeks ago, AUC dropped to 0.61. The clinical team only noticed because readmissions on wards using the model started to increase. No model or code changes were deployed around the time of the drop. Diagnose the issue.


Step 1: Establish the Timeline

Python
import pandas as pd
from datetime import datetime, timedelta

# First: gather evidence before theorizing
print("=== Establishing Timeline ===\n")

# Collect:
# 1. When exactly did the AUC drop?
# 2. What changed in the data pipeline around that time?
# 3. Did any upstream systems change?

timeline_questions = [
    ("2026-02-10", "Baseline AUC = 0.87 (validated)")
    ("2026-02-15", "New EHR software version deployed at Site B")
    ("2026-02-20", "Readmission model reports show AUC decay beginning")
    ("2026-03-01", "Full AUC drop detected: 0.61")
    ("2026-03-01", "No code changes were deployed to the model service")
]

print("Key events:")
for date, event in timeline_questions:
    print(f"  {date}: {event}")

print("\nHypothesis: EHR software change at Site B → schema/encoding change in features")

Step 2: Check the Prediction Distribution First

Python
import numpy as np
import pandas as pd

# Before debugging the model, check what it's actually outputting
def analyze_prediction_distribution(predictions_before: np.ndarray,
                                    predictions_after: np.ndarray) -> None:
    print("Prediction distribution comparison:")
    print(f"{'Metric':<20}  {'Before (Feb)':>14}  {'After (Mar)':>12}")
    print("-" * 50)
    print(f"{'Mean':>20}  {predictions_before.mean():>14.4f}  {predictions_after.mean():>12.4f}")
    print(f"{'Std':>20}  {predictions_before.std():>14.4f}  {predictions_after.std():>12.4f}")
    print(f"{'% pred > 0.5':>20}  {(predictions_before>0.5).mean():>14.3%}  {(predictions_after>0.5).mean():>12.3%}")
    print(f"{'% pred < 0.1':>20}  {(predictions_before<0.1).mean():>14.3%}  {(predictions_after<0.1).mean():>12.3%}")

# If predictions are all compressed near 0 or 0.5: likely a scaling/encoding issue
# If predictions look normal but AUC is low: concept drift or label issue
analyze_prediction_distribution(feb_predictions, mar_predictions)

Step 3: Feature Distribution Check

Python
from scipy.stats import ks_2samp
import numpy as np

def compare_feature_distributions(X_feb: pd.DataFrame, X_mar: pd.DataFrame) -> pd.DataFrame:
    """Compare feature distributions between two time periods."""
    results = []
    for col in X_feb.columns:
        ks_stat, pval = ks_2samp(X_feb[col].dropna(), X_mar[col].dropna())
        null_rate_feb = X_feb[col].isnull().mean()
        null_rate_mar = X_mar[col].isnull().mean()
        results.append({
            "feature":        col,
            "ks_statistic":   ks_stat,
            "p_value":        pval,
            "null_rate_feb":  null_rate_feb,
            "null_rate_mar":  null_rate_mar,
            "null_rate_delta": null_rate_mar - null_rate_feb,
            "drifted":        pval < 0.01,
        })

    df = pd.DataFrame(results).sort_values("ks_statistic", ascending=False)
    drifted = df[df["drifted"]]
    print(f"Features with significant drift: {len(drifted)}/{len(df)}")
    print(drifted[["feature", "ks_statistic", "null_rate_delta"]].head(10).to_string(index=False))
    return df

drift_report = compare_feature_distributions(X_feb, X_mar)

# FINDING: "discharge_to" feature null_rate jumped from 0.02 to 0.31 in March
# → upstream system stopped populating this field correctly

Step 4: Root Cause Found — Trace the Pipeline

Python
# The EHR software update at Site B changed the discharge destination coding
# Old format: "SNF", "home", "rehab", "home_with_help"
# New format: "01", "02", "03", "04" (numeric codes)

# The OneHotEncoder trained on text categories now receives numeric strings
# → All discharge_to values map to "unknown" or generate NaN
# → Model's discharge_to features are all zero (silently)

from sklearn.preprocessing import OneHotEncoder
import numpy as np

# Simulate the bug
ohe = OneHotEncoder(handle_unknown="ignore", sparse_output=False)
ohe.fit(np.array([["home"], ["SNF"], ["rehab"], ["home_with_help"]]))

# Old format (correct)
old_input = np.array([["home"], ["SNF"]])
print("Old format encoding:")
print(ohe.transform(old_input))   # → correct one-hot vectors

# New format (buggy — all zeros because "01" is unknown)
new_input = np.array([["01"], ["02"]])
print("\nNew format encoding (handle_unknown='ignore' silently zeros out):")
print(ohe.transform(new_input))   # → all zeros — the bug!

print("\nRoot cause: handle_unknown='ignore' silently zeroed out discharge_to features")
print("The model lost one of its most important predictive features — silently")

Step 5: Validate the Hypothesis

Python
# Retrain model without discharge_to — does performance drop to 0.61?
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_auc_score
import numpy as np

features_with    = [f for f in feature_names]   # all features including discharge_to
features_without = [f for f in feature_names if f != "discharge_to"]   # remove it

model_full = GradientBoostingClassifier(n_estimators=200, max_depth=3, random_state=42)
model_no_dc = GradientBoostingClassifier(n_estimators=200, max_depth=3, random_state=42)

auc_full  = cross_val_score(model_full,  X_train[features_with],    y_train, cv=5, scoring="roc_auc")
auc_no_dc = cross_val_score(model_no_dc, X_train[features_without], y_train, cv=5, scoring="roc_auc")

print(f"With discharge_to:    AUC = {auc_full.mean():.3f}")
print(f"Without discharge_to: AUC = {auc_no_dc.mean():.3f}")

# If without ā‰ˆ 0.61: confirmed — losing discharge_to caused the drop
print("\nHypothesis CONFIRMED" if abs(auc_no_dc.mean() - 0.61) < 0.05 else "Hypothesis needs revision")

Step 6: Remediation

Python
# Short-term fix: validate and handle the encoding mismatch

CODE_MAP = {"01": "home", "02": "SNF", "03": "rehab", "04": "home_with_help"}

def normalize_discharge_to(discharge_value: str) -> str:
    """Normalize discharge_to across EHR versions."""
    if discharge_value in CODE_MAP:
        return CODE_MAP[discharge_value]
    return discharge_value   # return as-is if already in text format

# Add validation to the prediction endpoint
def validate_and_normalize(patient_features: dict) -> dict:
    if "discharge_to" in patient_features:
        patient_features["discharge_to"] = normalize_discharge_to(
            str(patient_features["discharge_to"])
        )
    return patient_features

# Long-term fixes:
# 1. Add schema validation at the ingestion layer (fail loudly, not silently)
# 2. Add monitoring: alert if any feature null_rate increases by > 0.10
# 3. Add prediction distribution monitoring: alert if mean prediction shifts by > 2 std
# 4. Replace handle_unknown='ignore' with a sentinel value that the model explicitly handles

print("Short-term fix: add code normalization to preprocessing")
print("Long-term fix: schema validation + prediction monitoring")

What Interviewers Want to Hear

  1. Timeline first, hypothesis second — don't guess; gather evidence before diagnosing
  2. Check predictions before features — is the model making different predictions, or is the ground truth different?
  3. Feature drift is the most common production failure — always check distribution shifts
  4. Identify the silent failure mode — handle_unknown='ignore' is a classic silent bug
  5. Validate the hypothesis — confirm by ablation (remove the feature, reproduce the AUC drop)
  6. Two-part fix — short-term (stop the bleeding) and long-term (prevent recurrence)

One-line answer: "I'd check the prediction distribution first (is the model outputting differently?), then run feature drift analysis (which features shifted?), then trace the cause upstream. In this scenario: EHR software change → new encoding format → OneHotEncoder silently zeros out discharge_to features → model loses a key predictor. Fix: normalize the encoding at the preprocessing layer and add schema validation to fail loudly, not silently."

Enjoyed this article?

Explore the AI Systems learning path for more.

Found this helpful?

Share:š•

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.