What is Regularization?

The Problem Regularization Solves

Without constraints, a model minimizes training loss by memorizing the training data. Every noise point, every quirk, gets encoded into the weights.

Python

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
import numpy as np

# No regularization: model memorizes training data
unconstrained = DecisionTreeClassifier(max_depth=None)
unconstrained.fit(X_train, y_train)

train_score = unconstrained.score(X_train, y_train)
cv_scores   = cross_val_score(unconstrained, X_train, y_train, cv=5)

print(f"Train accuracy:   {train_score:.3f}")    # 1.000 — perfect memorization
print(f"CV accuracy:      {cv_scores.mean():.3f}")   # much lower — doesn't generalize
print(f"Overfit gap:      {train_score - cv_scores.mean():.3f}")

What Regularization Does

Regularization adds a penalty term to the loss function that discourages large weights. The model must now minimize both the training loss and the weight penalty simultaneously.

Standard loss:
  Minimize: L(y, ŷ)
  → Model will fit training data perfectly if it can

Regularized loss:
  Minimize: L(y, ŷ) + λ × Penalty(weights)
  → Model must trade off: reduce training loss vs keep weights small
  → Large weights that only explain noise become too expensive to keep

λ (lambda/alpha) = regularization strength
  λ = 0:      no regularization (standard loss)
  λ = large:  heavy regularization (weights forced toward zero)

The Weight Penalty in Practice

Python

from sklearn.linear_model import LogisticRegression
import numpy as np

# LogisticRegression uses C = 1/λ (inverse of regularization strength)
# Small C = strong regularization, Large C = weak regularization

results = []
for C in [0.001, 0.01, 0.1, 1, 10, 100, 1000]:
    model = LogisticRegression(C=C, max_iter=1000, random_state=42)
    model.fit(X_train_scaled, y_train)

    train_auc = roc_auc_score(y_train, model.predict_proba(X_train_scaled)[:, 1])
    cv_auc    = cross_val_score(model, X_train_scaled, y_train, cv=5, scoring="roc_auc").mean()
    weight_norm = np.linalg.norm(model.coef_)   # total magnitude of weights

    results.append((C, train_auc, cv_auc, weight_norm))
    print(f"C={C:6}: train_AUC={train_auc:.3f}, CV_AUC={cv_auc:.3f}, |w|={weight_norm:.2f}")

# As C increases (less regularization):
# → Train AUC increases (fits training data better)
# → Weight norm increases (larger weights)
# → CV AUC peaks then decreases (overfitting begins)

Regularization and Bias-Variance

More regularization (larger λ):
  → Weights are smaller and simpler
  → Model is less flexible → higher bias
  → Model is less sensitive to training data → lower variance
  → Result: underfitting if λ is too large

Less regularization (smaller λ):
  → Weights can be large and complex
  → Model is more flexible → lower bias
  → Model memorizes training data → higher variance
  → Result: overfitting if λ is too small

Optimal λ:
  → Minimizes total error = bias² + variance
  → Found by cross-validation

Three Types of Regularization

Python

from sklearn.linear_model import LogisticRegression, Ridge, Lasso
from sklearn.neural_network import MLPClassifier
import torch.nn as nn

# L2 Regularization (Ridge): add sum of squared weights
# Loss = L(y, ŷ) + λ × Σ(wᵢ²)
# → Shrinks all weights toward zero, but none become exactly zero
lr_l2 = LogisticRegression(penalty="l2", C=1.0, max_iter=1000)

# L1 Regularization (Lasso): add sum of absolute weights
# Loss = L(y, ŷ) + λ × Σ|wᵢ|
# → Forces some weights to exactly zero → automatic feature selection
lr_l1 = LogisticRegression(penalty="l1", C=1.0, solver="liblinear", max_iter=1000)

# Elastic Net: L1 + L2 combination
lr_elastic = LogisticRegression(penalty="elasticnet", C=1.0, l1_ratio=0.5,
                                solver="saga", max_iter=1000)

# Dropout (Neural Networks): randomly zero out activations during training
class RegularizedNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(20, 64),
            nn.ReLU(),
            nn.Dropout(0.3),     # 30% dropout = regularization
            nn.Linear(64, 32),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(32, 1),
        )

Regularization Strength Selection

Python

from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# Always search for regularization strength with cross-validation
pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("model", LogisticRegression(max_iter=1000)),
])

param_grid = {"model__C": [0.001, 0.01, 0.1, 1, 10, 100]}
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

search = GridSearchCV(pipeline, param_grid, cv=cv, scoring="roc_auc", n_jobs=-1)
search.fit(X_train, y_train)

print(f"Best C: {search.best_params_['model__C']}")
print(f"Best CV AUC: {search.best_score_:.3f}")

# Print all results
for params, mean_score, std_score in zip(
    search.cv_results_["params"],
    search.cv_results_["mean_test_score"],
    search.cv_results_["std_test_score"],
):
    print(f"  C={params['model__C']:6}: {mean_score:.3f} ± {std_score:.3f}")

When Regularization Helps Most

High feature-to-sample ratio (many features, few samples):
  → Without regularization: model uses irrelevant features, overfits
  → With regularization: irrelevant weights shrink toward zero

Correlated features:
  → Without L2: model assigns arbitrary weights to correlated features
  → With L2: weights are spread across correlated features

Sparse signal (few truly predictive features out of many):
  → With L1: automatically zeroes out non-predictive features

Neural networks with large parameter counts:
  → Without dropout: neurons co-adapt and memorize training data
  → With dropout: neurons must learn robust, independent features

Interview Answer Template

Q: What is regularization and why do we use it?

Regularization adds a penalty term to the loss function that discourages large model weights. Without it, a model minimizes training loss by memorizing noise in the training data — it overfits. The regularized objective forces the model to simultaneously reduce training loss and keep weights small, which pushes it toward simpler solutions that generalize better. The penalty strength λ (or C = 1/λ in sklearn) controls the tradeoff: larger λ means more regularization, smaller models, higher bias, lower variance. Smaller λ means less regularization, more complex models, lower bias, higher variance. The right λ is found by cross-validation. The two main forms are L2 (squared weights — shrinks all weights, keeps all features) and L1 (absolute weights — drives some weights to exactly zero, performing feature selection). For neural networks, dropout serves a similar purpose by randomly disabling neurons during training.

What is Regularization?

The Problem Regularization Solves

What Regularization Does

The Weight Penalty in Practice

Regularization and Bias-Variance

Three Types of Regularization

Regularization Strength Selection

When Regularization Helps Most

Interview Answer Template

Enjoyed this article?

Leave a comment