What is Overfitting?

Overfitting occurs when a model learns the training data too well — including noise and random fluctuations that aren't part of the true pattern. The result: excellent training performance, poor performance on new data.

Overfitting model:
  Training loss:   very low  (model memorized training examples)
  Validation loss: high       (model fails on new data)
  Gap: large

Well-fit model:
  Training loss:   low
  Validation loss: low and close to training loss
  Gap: small

A Concrete Example

Python

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import numpy as np

np.random.seed(42)
X = np.random.randn(300, 10)
y = (X[:, 0] + X[:, 1] > 0).astype(int)   # Simple linear boundary + noise

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.25)

# Unconstrained model: depth can grow until every training point is isolated
overfit_tree = DecisionTreeClassifier(max_depth=None)
overfit_tree.fit(X_train, y_train)
print(f"Overfit tree — Train: {accuracy_score(y_train, overfit_tree.predict(X_train)):.2%}")
print(f"Overfit tree — Val:   {accuracy_score(y_val,   overfit_tree.predict(X_val)):.2%}")
# Train: 100%, Val: ~62% — large gap

# Constrained model: limited depth prevents memorization
good_tree = DecisionTreeClassifier(max_depth=3)
good_tree.fit(X_train, y_train)
print(f"Good tree   — Train: {accuracy_score(y_train, good_tree.predict(X_train)):.2%}")
print(f"Good tree   — Val:   {accuracy_score(y_val,   good_tree.predict(X_val)):.2%}")
# Train: ~87%, Val: ~85% — small gap, generalizes

The Learning Curve

Python

import matplotlib.pyplot as plt
from sklearn.model_selection import learning_curve

def plot_learning_curve(model, X, y):
    train_sizes, train_scores, val_scores = learning_curve(
        model, X, y, cv=5, scoring="accuracy",
        train_sizes=np.linspace(0.1, 1.0, 10)
    )

    train_mean = train_scores.mean(axis=1)
    val_mean   = val_scores.mean(axis=1)

    # Plot:
    # Overfit: high train, low val → large gap at full training size
    # Good fit: train and val converge with more data

    return train_mean, val_mean, train_sizes

Overfitting signature: large gap between training and validation curves at full training size. The gap may close with more data.

What Causes Overfitting?

| Cause | Explanation | |---|---| | Model too complex | Too many parameters relative to training data | | Too few training examples | Model memorizes few examples instead of learning patterns | | Noisy features | Features with no signal still get weights assigned | | Too many training epochs | Model continues fitting noise after learning the pattern | | No regularization | Nothing penalizes model complexity | | Data leakage | Test/val statistics contaminated training — artificially good validation |

Detecting Overfitting

Python

def check_overfitting(model, X_train, y_train, X_val, y_val, threshold: float = 0.05) -> dict:
    """Return overfitting diagnosis based on train-val gap."""
    from sklearn.metrics import roc_auc_score

    train_auc = roc_auc_score(y_train, model.predict_proba(X_train)[:, 1])
    val_auc   = roc_auc_score(y_val,   model.predict_proba(X_val)[:, 1])
    gap = train_auc - val_auc

    severity = (
        "none"   if gap < 0.03 else
        "mild"   if gap < 0.08 else
        "severe" if gap >= 0.08 else "unknown"
    )

    return {
        "train_auc": round(train_auc, 3),
        "val_auc":   round(val_auc, 3),
        "gap":       round(gap, 3),
        "severity":  severity,
    }

First-Line Fixes

1. Simplify the Model (Reduce Capacity)

Python

# Decision tree: limit depth
DecisionTreeClassifier(max_depth=5)

# Neural network: fewer layers / neurons
# From 4 hidden layers of 512 neurons → 2 hidden layers of 64 neurons

# Random forest: limit tree depth
RandomForestClassifier(max_depth=8, min_samples_leaf=5)

2. Add Regularization

Python

from sklearn.linear_model import Ridge, Lasso, LogisticRegression

# L2 regularization: shrinks all weights
ridge = Ridge(alpha=1.0)   # alpha controls strength

# L1 regularization: drives some weights to zero
lasso = Lasso(alpha=0.01)

# Logistic regression: C is inverse of regularization strength
# Smaller C = more regularization
lr = LogisticRegression(C=0.1)

3. Get More Training Data

More data makes memorization harder. The model must generalize because no two training examples are identical.

Python

# If data is scarce, use data augmentation:
# - Image: flips, rotations, color jitter
# - Text: back-translation, synonym replacement
# - Tabular: SMOTE (synthetic minority over-sampling)

4. Early Stopping (Neural Networks)

Python

# Stop training when validation loss starts increasing
# Pytorch: manual patience counter
# Keras: EarlyStopping callback
from tensorflow.keras.callbacks import EarlyStopping

early_stop = EarlyStopping(
    monitor="val_loss",
    patience=5,          # Stop if no improvement for 5 epochs
    restore_best_weights=True,
)
model.fit(X_train, y_train, validation_data=(X_val, y_val), callbacks=[early_stop])

Interview Answer Template

Q: What is overfitting and how do you fix it?

Overfitting occurs when a model learns training data too closely — including noise — so it performs well on training data but poorly on new data. The signature is a large gap between training and validation performance. Common causes include a model that's too complex for the amount of training data, insufficient training examples, or too many epochs without regularization. The fixes depend on the cause: simplify the model (less depth, fewer neurons), add regularization (L1/L2 for linear models, Dropout for neural networks), gather more training data, or use early stopping. In practice, I'd diagnose by looking at the train/val performance gap, then apply the appropriate fix. If training accuracy is 98% and validation is 70%, that's severe overfitting — I'd start with regularization and reducing model complexity.

What is Overfitting?