Sampling Interview Traps — Statistics & Math for AI/ML Interviews | Learnixo

Trap 1: Random Split When Order Matters

Interview question: "Your model predicts hospital readmission 30 days after discharge.
How do you split your dataset?"

Wrong answer: random 80/20 split

Problem: temporal leakage
  If patient A was admitted in January (train) and February (test),
  the model might learn seasonal patterns in both splits
  Worse: if you include post-discharge events from January to evaluate
  January patients, you're leaking the future into training

Correct answer: time-based split
  Train: discharges before date X
  Test:  discharges after date X (with 30-day buffer for outcome follow-up)

  Also consider: patient-level split (same patient should not appear in
  both train and test — the model will overfit to individual patient history)

Trap 2: Leaking the Label

Interview question: "You're predicting ICU mortality. Your feature set includes
'total days on vasopressors'. AUC = 0.98. Is this a good model?"

Red flag: days_on_vasopressors is a label-adjacent feature
  Patients who die in the ICU tend to have more days on vasopressors
  But this feature is only known AFTER the ICU stay is completed
  If you're predicting at admission, this information isn't available yet

Correct diagnosis: temporal leakage
  The feature reflects what happens DURING the ICU stay → know at admission: No
  Fix: only include features available at the prediction time (admission)

Other leakage patterns:
  Including any variable that is downstream of the target
  Including the target itself in a renamed form
  Including features computed from the test period in training normalisation

Trap 3: Test Set Normalisation Leakage

Python

# WRONG: fit scaler on all data, then split
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)      # uses test data statistics!
X_train, X_test = train_test_split(X_scaled, ...)

# CORRECT: fit scaler ONLY on training data
X_train_raw, X_test_raw, y_train, y_test = train_test_split(X, y, ...)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train_raw)  # fit on train only
X_test  = scaler.transform(X_test_raw)      # transform test with train statistics

# Same for imputers, encoders, and feature selectors
# ALL preprocessing must be fit on training data only
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

# Pipeline prevents leakage automatically
pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("model", LogisticRegression()),
])
# fit() is called only on training data inside cross_val_score
from sklearn.model_selection import cross_val_score
scores = cross_val_score(pipe, X, y, cv=5)  # correct — no leakage

Trap 4: K-Fold Without Patient-Level Grouping

Python

# WRONG for clinical data with repeated measurements per patient
from sklearn.model_selection import KFold

kf = KFold(n_splits=5)
for train_idx, val_idx in kf.split(X):
    # Same patient may appear in both train and val!
    # Model memorises patient-specific patterns → inflated performance
    pass

# CORRECT: group-aware cross-validation
from sklearn.model_selection import GroupKFold

patient_ids = df["patient_id"].values
gkf = GroupKFold(n_splits=5)
for train_idx, val_idx in gkf.split(X, y, groups=patient_ids):
    # Each patient appears in exactly one fold
    # Performance estimate is for new patients, not existing ones
    pass

Trap 5: Evaluating on the Validation Set You Tuned On

Common mistake in competition ML (and industry):
  1. Train on train set
  2. Tune hyperparameters on validation set
  3. Report validation AUC as the model's performance

Problem: validation AUC is optimistically biased after hyperparameter tuning
  Each tuning round selects the hyperparameters that happen to work best
  on this particular validation set → overfitting to validation set

Correct approach:
  1. Train on train set
  2. Tune hyperparameters on validation set
  3. Select best hyperparameters
  4. Retrain on train + validation combined
  5. Evaluate ONCE on held-out test set
  
  Or: use nested cross-validation
  Outer loop: train/test split (5-fold)
  Inner loop: hyperparameter search (3-fold on training data only)

Trap 6: Ignoring Class Imbalance in Evaluation

Python

# With 95% class 0 and 5% class 1:
# A model that predicts class 0 for everything achieves 95% accuracy
# This is not useful

# Better metrics for imbalanced data:
from sklearn.metrics import (
    roc_auc_score, average_precision_score,  # threshold-independent
    f1_score, precision_score, recall_score,  # threshold-dependent
    balanced_accuracy_score,
    classification_report,
)

y_pred_all_zeros = np.zeros(len(y_test))
print(f"Naive accuracy: {(y_pred_all_zeros == y_test).mean():.3f}")  # 0.950
print(f"AUC-ROC:        {roc_auc_score(y_test, y_pred_all_zeros):.3f}")  # 0.500
print(f"AUC-PR:         {average_precision_score(y_test, y_pred_all_zeros):.3f}")  # 0.050

# AUC-PR (area under precision-recall curve) is the right metric for very
# imbalanced classes — AUC-ROC can still look reasonable despite useless model

The Right Answer Framework

When asked about splitting strategy in an interview:

1. Ask: "Is there a temporal dimension to the data?"
   Yes → time-based split, not random

2. Ask: "Are there repeated measurements per subject?"
   Yes → group-based split (GroupKFold, GroupShuffleSplit)

3. Ask: "What is the class balance?"
   Imbalanced → stratified split, use AUC-PR over accuracy

4. Ask: "What preprocessing will I apply?"
   Any → it must be fit on training data only, applied to test

5. Ask: "How will I do hyperparameter search?"
   Any → hold out a separate test set not used in any tuning decision

Interview Answer

"The common sampling traps I watch for: temporal leakage (random splits when data has time ordering — use chronological splits instead); patient-level grouping (same patient in train and test inflates performance — use GroupKFold); preprocessing leakage (fitting normalisation on all data before splitting — always fit on training data only, then transform test); and evaluation set contamination (tuning hyperparameters on the test set — always have a held-out test set used exactly once). For clinical data with class imbalance, I always use stratified splits and report AUC-ROC or AUC-PR rather than accuracy — a model predicting all zeros achieves 95% accuracy on a 5% positive rate dataset but is completely useless."