Statistics & Math for AI/ML Interviews · Lesson 30 of 30
Sampling Interview Traps
Trap 1: Random Split When Order Matters
Interview question: "Your model predicts hospital readmission 30 days after discharge.
How do you split your dataset?"
Wrong answer: random 80/20 split
Problem: temporal leakage
If patient A was admitted in January (train) and February (test),
the model might learn seasonal patterns in both splits
Worse: if you include post-discharge events from January to evaluate
January patients, you're leaking the future into training
Correct answer: time-based split
Train: discharges before date X
Test: discharges after date X (with 30-day buffer for outcome follow-up)
Also consider: patient-level split (same patient should not appear in
both train and test — the model will overfit to individual patient history)Trap 2: Leaking the Label
Interview question: "You're predicting ICU mortality. Your feature set includes
'total days on vasopressors'. AUC = 0.98. Is this a good model?"
Red flag: days_on_vasopressors is a label-adjacent feature
Patients who die in the ICU tend to have more days on vasopressors
But this feature is only known AFTER the ICU stay is completed
If you're predicting at admission, this information isn't available yet
Correct diagnosis: temporal leakage
The feature reflects what happens DURING the ICU stay → know at admission: No
Fix: only include features available at the prediction time (admission)
Other leakage patterns:
Including any variable that is downstream of the target
Including the target itself in a renamed form
Including features computed from the test period in training normalisationTrap 3: Test Set Normalisation Leakage
# WRONG: fit scaler on all data, then split
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X) # uses test data statistics!
X_train, X_test = train_test_split(X_scaled, ...)
# CORRECT: fit scaler ONLY on training data
X_train_raw, X_test_raw, y_train, y_test = train_test_split(X, y, ...)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train_raw) # fit on train only
X_test = scaler.transform(X_test_raw) # transform test with train statistics
# Same for imputers, encoders, and feature selectors
# ALL preprocessing must be fit on training data only
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
# Pipeline prevents leakage automatically
pipe = Pipeline([
("scaler", StandardScaler()),
("model", LogisticRegression()),
])
# fit() is called only on training data inside cross_val_score
from sklearn.model_selection import cross_val_score
scores = cross_val_score(pipe, X, y, cv=5) # correct — no leakageTrap 4: K-Fold Without Patient-Level Grouping
# WRONG for clinical data with repeated measurements per patient
from sklearn.model_selection import KFold
kf = KFold(n_splits=5)
for train_idx, val_idx in kf.split(X):
# Same patient may appear in both train and val!
# Model memorises patient-specific patterns → inflated performance
pass
# CORRECT: group-aware cross-validation
from sklearn.model_selection import GroupKFold
patient_ids = df["patient_id"].values
gkf = GroupKFold(n_splits=5)
for train_idx, val_idx in gkf.split(X, y, groups=patient_ids):
# Each patient appears in exactly one fold
# Performance estimate is for new patients, not existing ones
passTrap 5: Evaluating on the Validation Set You Tuned On
Common mistake in competition ML (and industry):
1. Train on train set
2. Tune hyperparameters on validation set
3. Report validation AUC as the model's performance
Problem: validation AUC is optimistically biased after hyperparameter tuning
Each tuning round selects the hyperparameters that happen to work best
on this particular validation set → overfitting to validation set
Correct approach:
1. Train on train set
2. Tune hyperparameters on validation set
3. Select best hyperparameters
4. Retrain on train + validation combined
5. Evaluate ONCE on held-out test set
Or: use nested cross-validation
Outer loop: train/test split (5-fold)
Inner loop: hyperparameter search (3-fold on training data only)Trap 6: Ignoring Class Imbalance in Evaluation
# With 95% class 0 and 5% class 1:
# A model that predicts class 0 for everything achieves 95% accuracy
# This is not useful
# Better metrics for imbalanced data:
from sklearn.metrics import (
roc_auc_score, average_precision_score, # threshold-independent
f1_score, precision_score, recall_score, # threshold-dependent
balanced_accuracy_score,
classification_report,
)
y_pred_all_zeros = np.zeros(len(y_test))
print(f"Naive accuracy: {(y_pred_all_zeros == y_test).mean():.3f}") # 0.950
print(f"AUC-ROC: {roc_auc_score(y_test, y_pred_all_zeros):.3f}") # 0.500
print(f"AUC-PR: {average_precision_score(y_test, y_pred_all_zeros):.3f}") # 0.050
# AUC-PR (area under precision-recall curve) is the right metric for very
# imbalanced classes — AUC-ROC can still look reasonable despite useless modelThe Right Answer Framework
When asked about splitting strategy in an interview:
1. Ask: "Is there a temporal dimension to the data?"
Yes → time-based split, not random
2. Ask: "Are there repeated measurements per subject?"
Yes → group-based split (GroupKFold, GroupShuffleSplit)
3. Ask: "What is the class balance?"
Imbalanced → stratified split, use AUC-PR over accuracy
4. Ask: "What preprocessing will I apply?"
Any → it must be fit on training data only, applied to test
5. Ask: "How will I do hyperparameter search?"
Any → hold out a separate test set not used in any tuning decisionInterview Answer
"The common sampling traps I watch for: temporal leakage (random splits when data has time ordering — use chronological splits instead); patient-level grouping (same patient in train and test inflates performance — use GroupKFold); preprocessing leakage (fitting normalisation on all data before splitting — always fit on training data only, then transform test); and evaluation set contamination (tuning hyperparameters on the test set — always have a held-out test set used exactly once). For clinical data with class imbalance, I always use stratified splits and report AUC-ROC or AUC-PR rather than accuracy — a model predicting all zeros achieves 95% accuracy on a 5% positive rate dataset but is completely useless."