Machine Learning Foundations · Lesson 23 of 70
How to Detect Overfitting
The Primary Signal: Train/Validation Gap
The most direct way to detect overfitting is comparing training and validation performance.
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import roc_auc_score
import numpy as np
def evaluate_fit(model, X_train, y_train, X_val, y_val) -> dict:
"""Compute train/val gap and classify overfitting severity."""
train_auc = roc_auc_score(y_train, model.predict_proba(X_train)[:, 1])
val_auc = roc_auc_score(y_val, model.predict_proba(X_val)[:, 1])
gap = train_auc - val_auc
if gap < 0.03:
diagnosis = "No overfitting"
elif gap < 0.07:
diagnosis = "Mild overfitting (acceptable)"
elif gap < 0.12:
diagnosis = "Moderate overfitting — consider regularization"
else:
diagnosis = "Severe overfitting — model is memorizing"
return {
"train_auc": round(train_auc, 4),
"val_auc": round(val_auc, 4),
"gap": round(gap, 4),
"diagnosis": diagnosis,
}Learning Curves
A learning curve plots performance vs training set size. Overfitting shows a large gap that doesn't close as more data is added.
from sklearn.model_selection import learning_curve
from sklearn.tree import DecisionTreeClassifier
import numpy as np
X = np.random.randn(1000, 10)
y = (X[:, 0] + np.random.randn(1000) * 0.5 > 0).astype(int)
def plot_learning_curves(model, X, y):
"""Plot training and validation scores as training size increases."""
train_sizes = np.linspace(0.05, 1.0, 20)
sizes, train_scores, val_scores = learning_curve(
model, X, y, train_sizes=train_sizes, cv=5, scoring="roc_auc"
)
train_mean = train_scores.mean(axis=1)
val_mean = val_scores.mean(axis=1)
train_std = train_scores.std(axis=1)
val_std = val_scores.std(axis=1)
# Overfit: train stays high, val stays low, gap persists
# Good fit: both converge as training size grows
final_gap = train_mean[-1] - val_mean[-1]
print(f"Final training size AUC gap: {final_gap:.3f}")
return sizes, train_mean, val_mean
# Overfit example: unconstrained tree
tree = DecisionTreeClassifier(max_depth=None)
plot_learning_curves(tree, X, y)
# Good fit: constrained tree
tree_good = DecisionTreeClassifier(max_depth=4)
plot_learning_curves(tree_good, X, y)Epoch-by-Epoch Monitoring (Neural Networks)
For neural networks, track both training and validation loss per epoch. Overfitting appears when validation loss stops improving while training loss continues to decrease.
import torch
import torch.nn as nn
def track_losses(model, train_loader, val_loader, n_epochs: int = 50):
"""Track train and val loss per epoch to detect overfitting."""
optimizer = torch.optim.Adam(model.parameters())
criterion = nn.CrossEntropyLoss()
train_losses, val_losses = [], []
for epoch in range(n_epochs):
# Training
model.train()
batch_losses = []
for X_batch, y_batch in train_loader:
optimizer.zero_grad()
loss = criterion(model(X_batch), y_batch)
loss.backward()
optimizer.step()
batch_losses.append(loss.item())
train_losses.append(np.mean(batch_losses))
# Validation
model.eval()
batch_val_losses = []
with torch.no_grad():
for X_val, y_val in val_loader:
val_loss = criterion(model(X_val), y_val)
batch_val_losses.append(val_loss.item())
val_losses.append(np.mean(batch_val_losses))
if epoch % 10 == 0:
print(f"Epoch {epoch:3d}: train={train_losses[-1]:.4f}, val={val_losses[-1]:.4f}")
# Overfitting signature: val_loss starts increasing
best_val_epoch = np.argmin(val_losses)
if n_epochs - best_val_epoch > 10:
print(f"\nOVERFITTING DETECTED: val loss best at epoch {best_val_epoch}, "
f"but trained for {n_epochs} epochs")
return train_losses, val_lossesAutomated Overfitting Check
class OverfitChecker:
"""
Automated overfitting detector for production ML training pipelines.
"""
def __init__(self, gap_threshold: float = 0.08):
self.gap_threshold = gap_threshold
self.history: list[dict] = []
def check(
self,
model,
X_train, y_train,
X_val, y_val,
epoch: int | None = None,
) -> bool:
"""Returns True if overfitting is detected."""
train_auc = roc_auc_score(y_train, model.predict_proba(X_train)[:, 1])
val_auc = roc_auc_score(y_val, model.predict_proba(X_val)[:, 1])
gap = train_auc - val_auc
self.history.append({
"epoch": epoch,
"train_auc": round(train_auc, 4),
"val_auc": round(val_auc, 4),
"gap": round(gap, 4),
})
if gap > self.gap_threshold:
print(f"OVERFIT ALERT: gap={gap:.3f} > threshold={self.gap_threshold}")
return True
return False
def summary(self) -> str:
gaps = [h["gap"] for h in self.history]
return f"Max gap: {max(gaps):.3f}, Min val AUC: {min(h['val_auc'] for h in self.history):.3f}"What the Confusion Matrix Reveals
If the model has very different confusion matrices on training vs validation data, it's overfit.
from sklearn.metrics import confusion_matrix
y_train_pred = model.predict(X_train)
y_val_pred = model.predict(X_val)
print("Training confusion matrix:")
print(confusion_matrix(y_train, y_train_pred))
# [[490, 10], [5, 495]] — almost perfect
print("Validation confusion matrix:")
print(confusion_matrix(y_val, y_val_pred))
# [[78, 22], [30, 70]] — many more errors
# Large discrepancy → overfittingOverfitting vs Distribution Shift
If production performance drops significantly below validation performance (but validation was measured correctly), the problem might be data drift, not overfitting.
Train → Val gap: overfitting (model is too complex)
Val → Production gap: distribution shift (data changed since training)Quick Diagnostic Checklist
□ Train accuracy >> Val accuracy? → Overfitting
□ Both train and val accuracy are low? → Underfitting
□ Val loss increasing while train loss decreasing? → Overfitting (neural network)
□ Learning curve: gap doesn't close with more data? → High variance (overfitting)
□ Val score >> Test score? → Validation overfitting (too many experiments on val)
□ Val score similar to test score? → Validation set is representativeInterview Answer Template
Q: How do you detect overfitting?
The primary signal is the gap between training and validation performance — if training accuracy is 97% and validation is 72%, that's severe overfitting. For neural networks, I track training and validation loss per epoch: when validation loss starts increasing while training loss continues decreasing, that's the overfitting point (and where early stopping should kick in). Learning curves are also useful: plotting performance vs training set size shows whether the gap narrows as more data is added — if the gap persists, the model has too much capacity. In practice, I'd set up automated gap monitoring as part of the training loop, alert if the gap exceeds a threshold, and compare the confusion matrices on training vs validation to understand which cases the model is memorizing.