Machine Learning Foundations · Lesson 24 of 70

How to Fix Overfitting: Dropout, Regularization, Data

The Fix Depends on the Cause

Before applying a fix, identify why the model is overfitting:

Model is too complex for data size → reduce capacity or regularize
Not enough training data → augment or collect more
Too many epochs → early stopping
Noisy or irrelevant features → feature selection

Fix 1: L2 Regularization (Ridge)

Adds a penalty to the loss proportional to the squared magnitude of weights. Shrinks all weights toward zero — prevents any single feature from dominating.

Python

from sklearn.linear_model import Ridge, LogisticRegression
from sklearn.model_selection import cross_val_score
import numpy as np

X = np.random.randn(200, 50)   # 200 samples, 50 features (many irrelevant)
y = X[:, 0] + X[:, 1] + np.random.randn(200) * 0.5 > 0   # Only first 2 matter

# No regularization: overfit
no_reg = LogisticRegression(C=1000, max_iter=1000)   # C large = weak regularization
score_no_reg = cross_val_score(no_reg, X, y, cv=5, scoring="accuracy").mean()

# L2 regularization: penalize weight magnitude
l2_reg = LogisticRegression(C=0.01, max_iter=1000)   # C small = strong regularization
score_l2 = cross_val_score(l2_reg, X, y, cv=5, scoring="accuracy").mean()

print(f"No regularization: {score_no_reg:.2%}")
print(f"L2 regularization: {score_l2:.2%}")

Fix 2: L1 Regularization (Lasso)

Adds a penalty proportional to the absolute magnitude of weights. Drives some weights to exactly zero — effective feature selection.

Python

from sklearn.linear_model import Lasso, LogisticRegression

# L1 in regression: Lasso
lasso = Lasso(alpha=0.1)
lasso.fit(X, y)
n_nonzero = np.sum(lasso.coef_ != 0)
print(f"Features with non-zero weights: {n_nonzero} / {X.shape[1]}")
# Often only 2-5 features matter, Lasso zeros out the rest

# L1 in logistic regression
lr_l1 = LogisticRegression(penalty="l1", C=0.1, solver="liblinear", max_iter=1000)

Fix 3: Dropout (Neural Networks)

During training, randomly zero out a fraction of neurons at each forward pass. The network must learn redundant representations — it can't rely on any single neuron.

Python

import torch
import torch.nn as nn

class DrugClassifier(nn.Module):
    def __init__(self, input_dim: int, n_classes: int, dropout_rate: float = 0.3):
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(input_dim, 256),
            nn.ReLU(),
            nn.Dropout(dropout_rate),       # 30% of neurons zeroed during training
            nn.Linear(256, 128),
            nn.ReLU(),
            nn.Dropout(dropout_rate),       # Applied again in second layer
            nn.Linear(128, n_classes),
        )

    def forward(self, x):
        return self.network(x)

# IMPORTANT: Dropout only active during training
model.train()   # Dropout is ON
model.eval()    # Dropout is OFF — deterministic inference

Typical dropout rates: 0.1–0.5. Start with 0.2–0.3 for hidden layers.

Fix 4: Early Stopping

Stop training when the validation metric stops improving.

Python

class EarlyStopper:
    def __init__(self, patience: int = 5, min_delta: float = 1e-4):
        self.patience = patience
        self.min_delta = min_delta
        self.best_loss = float("inf")
        self.no_improve = 0

    def should_stop(self, val_loss: float) -> bool:
        if val_loss < self.best_loss - self.min_delta:
            self.best_loss = val_loss
            self.no_improve = 0
            return False
        else:
            self.no_improve += 1
            return self.no_improve >= self.patience


stopper = EarlyStopper(patience=10)

for epoch in range(500):
    train_loss = train_epoch(model, train_loader)
    val_loss = evaluate(model, val_loader)

    if stopper.should_stop(val_loss):
        print(f"Early stopping at epoch {epoch}")
        break

Fix 5: More Training Data

The most reliable fix — but not always possible.

Python

from sklearn.datasets import make_classification

# Demonstrate: overfitting disappears with more data
for n_samples in [100, 500, 2000, 10000]:
    X, y = make_classification(n_samples=n_samples, n_features=50, n_informative=5)
    X_tr, X_val, y_tr, y_val = train_test_split(X, y, test_size=0.2)

    from sklearn.ensemble import GradientBoostingClassifier
    model = GradientBoostingClassifier()
    model.fit(X_tr, y_tr)

    gap = model.score(X_tr, y_tr) - model.score(X_val, y_val)
    print(f"n={n_samples:5d}: gap={gap:.3f}")

# n=  100: gap=0.287  (severe overfitting)
# n=  500: gap=0.098  (moderate)
# n= 2000: gap=0.031  (mild)
# n=10000: gap=0.008  (negligible)

Fix 6: Data Augmentation

When collecting more data isn't possible, augment the existing data synthetically.

Python

# Text augmentation: paraphrase, synonym replacement, back-translation
import random

def augment_clinical_text(note: str) -> str:
    """Simple synonym replacement for clinical text augmentation."""
    replacements = {
        "warfarin": "Coumadin",
        "aspirin": "acetylsalicylic acid",
        "metformin": "Glucophage",
    }
    for original, synonym in replacements.items():
        if random.random() < 0.3:   # 30% chance to replace
            note = note.replace(original, synonym)
    return note

# Tabular augmentation: SMOTE (Synthetic Minority Over-sampling Technique)
from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
print(f"Before: {y_train.sum()} positives | After: {y_resampled.sum()} positives")

Fix 7: Reduce Model Complexity

Python

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# Decision tree: limit depth
tree = DecisionTreeClassifier(
    max_depth=5,            # Was None (unlimited)
    min_samples_leaf=10,    # Require at least 10 samples per leaf
    min_samples_split=20,   # Require at least 20 samples to split
)

# Random forest: limit individual trees
rf = RandomForestClassifier(
    n_estimators=100,
    max_depth=8,
    max_features="sqrt",    # Each tree sees only sqrt(n_features) features
    min_samples_leaf=5,
)

# Neural network: fewer parameters
# From [512, 512, 256, 128] → [64, 32]

Decision Guide

| Symptom | Most Likely Fix | |---|---| | Large train/val gap, limited data | Regularization (L1/L2/Dropout) | | Val loss increases after many epochs | Early stopping | | Very few training examples | Data augmentation + collect more | | Many irrelevant features | L1 regularization or feature selection | | Model is too complex by design | Reduce capacity (fewer layers/neurons/depth) | | Class imbalance + overfitting minority | SMOTE + class weighting |

Interview Answer Template

Q: How would you fix an overfitting model?

The fix depends on the root cause. If the model is too complex, I'd reduce capacity — fewer layers, limited tree depth, or lower max features. I'd add L2 regularization for linear models or Dropout (rate 0.2–0.5) for neural networks to penalize complexity. If training is running too long, I'd add early stopping with patience of 5–10 epochs. If the dataset is small, data augmentation (synonym replacement for text, SMOTE for tabular) can help. More training data is the most reliable fix when feasible. L1 regularization is useful when many features are irrelevant — it zeros out the weights of noisy features automatically. In practice, I'd monitor the train/val gap throughout training, not just at the end, so I can intervene early if overfitting emerges.

How to Detect Overfitting

Next Lesson

Interview: Overfitting Walk-Through Scenario