Interview: Overfitting Walk-Through Scenario

The Scenario

You trained a gradient boosting model to classify drugs into four therapeutic categories (anticoagulant, antidiabetic, antihypertensive, antibiotic) using molecular features. The model achieves 98% accuracy on the training set but only 67% on the validation set. The team is asking: what's going wrong and what should you do?

Step 1: Confirm Overfitting

Before proposing fixes, verify the diagnosis is correct.

Python

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, classification_report
import numpy as np

# Reproduce the reported numbers
print(f"Training accuracy:   {accuracy_score(y_train, model.predict(X_train)):.2%}")  # 98%
print(f"Validation accuracy: {accuracy_score(y_val,   model.predict(X_val)):.2%}")    # 67%
print(f"Gap: {0.98 - 0.67:.0%}")   # 31 percentage points — severe overfitting

# Look at per-class performance
print("\nTraining classification report:")
print(classification_report(y_train, model.predict(X_train)))

print("\nValidation classification report:")
print(classification_report(y_val, model.predict(X_val)))
# Does the model struggle equally on all classes, or just some?

Confirmed: 31% gap is severe overfitting. Training metrics are meaningless.

Step 2: Identify Root Causes

Ask diagnostic questions before writing code.

Python

# How many training samples vs features?
print(f"Training samples:  {X_train.shape[0]}")   # e.g., 150
print(f"Features:          {X_train.shape[1]}")   # e.g., 80
# → 150 samples, 80 features — very high feature-to-sample ratio

# Model hyperparameters
model.get_params()
# {'n_estimators': 500, 'max_depth': None, 'learning_rate': 0.1}
# → max_depth=None means trees can grow arbitrarily deep → memorization

# Class distribution
import pandas as pd
print(pd.Series(y_train).value_counts(normalize=True))
# anticoagulant: 0.45, antidiabetic: 0.30, antihypertensive: 0.15, antibiotic: 0.10
# → class imbalance — 10% antibiotic

Root causes identified:

max_depth=None → trees grow deep enough to memorize training data
High feature-to-sample ratio (80 features, 150 samples)
Class imbalance (minority class gets memorized first)

Step 3: Apply Targeted Fixes

Python

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.metrics import accuracy_score

# Fix A: Constrain model complexity
model_v2 = GradientBoostingClassifier(
    n_estimators=100,      # Reduced from 500
    max_depth=3,           # Was None — set a hard limit
    min_samples_leaf=5,    # Each leaf must have at least 5 samples
    learning_rate=0.05,    # Slower learning → less memorization
    subsample=0.8,         # Use 80% of data per tree → stochastic boosting
    max_features="sqrt",   # Random feature subset per split
    random_state=42,
)

# Fix B: Use cross-validation for reliable evaluation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(model_v2, X_train, y_train, cv=cv, scoring="accuracy")
print(f"5-fold CV accuracy: {cv_scores.mean():.2%} ± {cv_scores.std():.2%}")

# Fix C: Handle class imbalance
model_v3 = GradientBoostingClassifier(max_depth=3, n_estimators=100)
# Note: GradientBoostingClassifier doesn't have class_weight, but:
# Option 1: sample_weight in fit()
sample_weights = compute_sample_weight("balanced", y_train)
model_v3.fit(X_train, y_train, sample_weight=sample_weights)

# Fix D: Feature selection — reduce dimensionality
from sklearn.feature_selection import SelectKBest, f_classif
selector = SelectKBest(f_classif, k=20)   # Keep top 20 features
X_train_selected = selector.fit_transform(X_train, y_train)
X_val_selected   = selector.transform(X_val)

model_v4 = GradientBoostingClassifier(max_depth=3, n_estimators=100)
model_v4.fit(X_train_selected, y_train)
print(f"With feature selection — Val: {accuracy_score(y_val, model_v4.predict(X_val_selected)):.2%}")

Step 4: Compare Results

Python

experiments = {
    "Original (overfit)":          accuracy_score(y_val, model.predict(X_val)),
    "Constrained depth":           0.0,   # Fill in from experiments
    "Feature selection":           0.0,
    "Cross-validated ensemble":    cv_scores.mean(),
}

for name, score in experiments.items():
    print(f"{name:<30s}: {score:.2%}")

Step 5: Longer-Term Fixes

If the model still underperforms after these fixes:

Collect more data:

Python

# 150 samples with 80 features is a red flag
# Rule of thumb: aim for at least 10-50× more samples than features
# With 80 features → aim for 800-4000 samples

Regularize more aggressively:

Python

from sklearn.linear_model import LogisticRegression

# For high feature-to-sample ratio, logistic regression with L2 often outperforms boosting
lr = LogisticRegression(C=0.01, multi_class="multinomial", max_iter=1000)
cv_lr = cross_val_score(lr, X_train, y_train, cv=5, scoring="accuracy")
print(f"Logistic Regression CV: {cv_lr.mean():.2%}")
# Often beats gradient boosting when n << p (few samples, many features)

Use simpler representations:

Python

# 80 molecular features → maybe 10 chemically meaningful ones
# Domain knowledge reduces noise more effectively than regularization

Model Card: What to Report

Python

final_results = {
    "Model": "GradientBoostingClassifier",
    "Hyperparameters": {"max_depth": 3, "n_estimators": 100, "learning_rate": 0.05},
    "5-fold CV accuracy": f"{cv_scores.mean():.2%} ± {cv_scores.std():.2%}",
    "Validation accuracy": f"{accuracy_score(y_val, model_v2.predict(X_val)):.2%}",
    "Test accuracy": "PENDING — not evaluated yet",
    "Note": "Test set held for final evaluation only",
}

Interview Summary: What Interviewers Want to Hear

Confirm the problem with numbers, don't assume
Investigate causes before proposing fixes (model complexity, data ratio, imbalance)
Apply targeted fixes — don't just "add regularization" without knowing why
Measure improvement — did the fix actually work?
Know the tradeoffs — tighter constraints may underfit; feature reduction may lose signal
Think about data — 150 samples for 80 features is a fundamental problem no hyperparameter can fully fix

One-line answer: "98% training, 67% validation is severe overfitting. I'd constrain the model (max_depth=3, subsample=0.8), use cross-validation for evaluation, reduce features to the most informative 20, and handle class imbalance with sample weights. If the gap persists, the root cause is insufficient training data for 80 features — I'd prioritize data collection or dimensionality reduction."

Interview: Overfitting Walk-Through Scenario

The Scenario

Step 1: Confirm Overfitting

Step 2: Identify Root Causes

Step 3: Apply Targeted Fixes

Step 4: Compare Results

Step 5: Longer-Term Fixes

Model Card: What to Report

Interview Summary: What Interviewers Want to Hear

Enjoyed this article?

Leave a comment