What is Overfitting?
Understand overfitting: why models memorize training noise, how to detect it from learning curves, common causes, and the first-line fixes ā with code examples and interview-ready explanations.
What is Overfitting?
Overfitting occurs when a model learns the training data too well ā including noise and random fluctuations that aren't part of the true pattern. The result: excellent training performance, poor performance on new data.
Overfitting model:
Training loss: very low (model memorized training examples)
Validation loss: high (model fails on new data)
Gap: large
Well-fit model:
Training loss: low
Validation loss: low and close to training loss
Gap: smallA Concrete Example
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import numpy as np
np.random.seed(42)
X = np.random.randn(300, 10)
y = (X[:, 0] + X[:, 1] > 0).astype(int) # Simple linear boundary + noise
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.25)
# Unconstrained model: depth can grow until every training point is isolated
overfit_tree = DecisionTreeClassifier(max_depth=None)
overfit_tree.fit(X_train, y_train)
print(f"Overfit tree ā Train: {accuracy_score(y_train, overfit_tree.predict(X_train)):.2%}")
print(f"Overfit tree ā Val: {accuracy_score(y_val, overfit_tree.predict(X_val)):.2%}")
# Train: 100%, Val: ~62% ā large gap
# Constrained model: limited depth prevents memorization
good_tree = DecisionTreeClassifier(max_depth=3)
good_tree.fit(X_train, y_train)
print(f"Good tree ā Train: {accuracy_score(y_train, good_tree.predict(X_train)):.2%}")
print(f"Good tree ā Val: {accuracy_score(y_val, good_tree.predict(X_val)):.2%}")
# Train: ~87%, Val: ~85% ā small gap, generalizesThe Learning Curve
import matplotlib.pyplot as plt
from sklearn.model_selection import learning_curve
def plot_learning_curve(model, X, y):
train_sizes, train_scores, val_scores = learning_curve(
model, X, y, cv=5, scoring="accuracy",
train_sizes=np.linspace(0.1, 1.0, 10)
)
train_mean = train_scores.mean(axis=1)
val_mean = val_scores.mean(axis=1)
# Plot:
# Overfit: high train, low val ā large gap at full training size
# Good fit: train and val converge with more data
return train_mean, val_mean, train_sizesOverfitting signature: large gap between training and validation curves at full training size. The gap may close with more data.
What Causes Overfitting?
| Cause | Explanation | |---|---| | Model too complex | Too many parameters relative to training data | | Too few training examples | Model memorizes few examples instead of learning patterns | | Noisy features | Features with no signal still get weights assigned | | Too many training epochs | Model continues fitting noise after learning the pattern | | No regularization | Nothing penalizes model complexity | | Data leakage | Test/val statistics contaminated training ā artificially good validation |
Detecting Overfitting
def check_overfitting(model, X_train, y_train, X_val, y_val, threshold: float = 0.05) -> dict:
"""Return overfitting diagnosis based on train-val gap."""
from sklearn.metrics import roc_auc_score
train_auc = roc_auc_score(y_train, model.predict_proba(X_train)[:, 1])
val_auc = roc_auc_score(y_val, model.predict_proba(X_val)[:, 1])
gap = train_auc - val_auc
severity = (
"none" if gap < 0.03 else
"mild" if gap < 0.08 else
"severe" if gap >= 0.08 else "unknown"
)
return {
"train_auc": round(train_auc, 3),
"val_auc": round(val_auc, 3),
"gap": round(gap, 3),
"severity": severity,
}First-Line Fixes
1. Simplify the Model (Reduce Capacity)
# Decision tree: limit depth
DecisionTreeClassifier(max_depth=5)
# Neural network: fewer layers / neurons
# From 4 hidden layers of 512 neurons ā 2 hidden layers of 64 neurons
# Random forest: limit tree depth
RandomForestClassifier(max_depth=8, min_samples_leaf=5)2. Add Regularization
from sklearn.linear_model import Ridge, Lasso, LogisticRegression
# L2 regularization: shrinks all weights
ridge = Ridge(alpha=1.0) # alpha controls strength
# L1 regularization: drives some weights to zero
lasso = Lasso(alpha=0.01)
# Logistic regression: C is inverse of regularization strength
# Smaller C = more regularization
lr = LogisticRegression(C=0.1)3. Get More Training Data
More data makes memorization harder. The model must generalize because no two training examples are identical.
# If data is scarce, use data augmentation:
# - Image: flips, rotations, color jitter
# - Text: back-translation, synonym replacement
# - Tabular: SMOTE (synthetic minority over-sampling)4. Early Stopping (Neural Networks)
# Stop training when validation loss starts increasing
# Pytorch: manual patience counter
# Keras: EarlyStopping callback
from tensorflow.keras.callbacks import EarlyStopping
early_stop = EarlyStopping(
monitor="val_loss",
patience=5, # Stop if no improvement for 5 epochs
restore_best_weights=True,
)
model.fit(X_train, y_train, validation_data=(X_val, y_val), callbacks=[early_stop])Interview Answer Template
Q: What is overfitting and how do you fix it?
Overfitting occurs when a model learns training data too closely ā including noise ā so it performs well on training data but poorly on new data. The signature is a large gap between training and validation performance. Common causes include a model that's too complex for the amount of training data, insufficient training examples, or too many epochs without regularization. The fixes depend on the cause: simplify the model (less depth, fewer neurons), add regularization (L1/L2 for linear models, Dropout for neural networks), gather more training data, or use early stopping. In practice, I'd diagnose by looking at the train/val performance gap, then apply the appropriate fix. If training accuracy is 98% and validation is 70%, that's severe overfitting ā I'd start with regularization and reducing model complexity.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.