What is Regularization?
Regularization fundamentals: why models overfit, what regularization adds to the loss function, how it constrains model complexity, and the intuition behind the bias-variance tradeoff it controls.
The Problem Regularization Solves
Without constraints, a model minimizes training loss by memorizing the training data. Every noise point, every quirk, gets encoded into the weights.
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
import numpy as np
# No regularization: model memorizes training data
unconstrained = DecisionTreeClassifier(max_depth=None)
unconstrained.fit(X_train, y_train)
train_score = unconstrained.score(X_train, y_train)
cv_scores = cross_val_score(unconstrained, X_train, y_train, cv=5)
print(f"Train accuracy: {train_score:.3f}") # 1.000 β perfect memorization
print(f"CV accuracy: {cv_scores.mean():.3f}") # much lower β doesn't generalize
print(f"Overfit gap: {train_score - cv_scores.mean():.3f}")What Regularization Does
Regularization adds a penalty term to the loss function that discourages large weights. The model must now minimize both the training loss and the weight penalty simultaneously.
Standard loss:
Minimize: L(y, Ε·)
β Model will fit training data perfectly if it can
Regularized loss:
Minimize: L(y, Ε·) + Ξ» Γ Penalty(weights)
β Model must trade off: reduce training loss vs keep weights small
β Large weights that only explain noise become too expensive to keep
Ξ» (lambda/alpha) = regularization strength
Ξ» = 0: no regularization (standard loss)
Ξ» = large: heavy regularization (weights forced toward zero)The Weight Penalty in Practice
from sklearn.linear_model import LogisticRegression
import numpy as np
# LogisticRegression uses C = 1/Ξ» (inverse of regularization strength)
# Small C = strong regularization, Large C = weak regularization
results = []
for C in [0.001, 0.01, 0.1, 1, 10, 100, 1000]:
model = LogisticRegression(C=C, max_iter=1000, random_state=42)
model.fit(X_train_scaled, y_train)
train_auc = roc_auc_score(y_train, model.predict_proba(X_train_scaled)[:, 1])
cv_auc = cross_val_score(model, X_train_scaled, y_train, cv=5, scoring="roc_auc").mean()
weight_norm = np.linalg.norm(model.coef_) # total magnitude of weights
results.append((C, train_auc, cv_auc, weight_norm))
print(f"C={C:6}: train_AUC={train_auc:.3f}, CV_AUC={cv_auc:.3f}, |w|={weight_norm:.2f}")
# As C increases (less regularization):
# β Train AUC increases (fits training data better)
# β Weight norm increases (larger weights)
# β CV AUC peaks then decreases (overfitting begins)Regularization and Bias-Variance
More regularization (larger Ξ»):
β Weights are smaller and simpler
β Model is less flexible β higher bias
β Model is less sensitive to training data β lower variance
β Result: underfitting if Ξ» is too large
Less regularization (smaller Ξ»):
β Weights can be large and complex
β Model is more flexible β lower bias
β Model memorizes training data β higher variance
β Result: overfitting if Ξ» is too small
Optimal Ξ»:
β Minimizes total error = biasΒ² + variance
β Found by cross-validationThree Types of Regularization
from sklearn.linear_model import LogisticRegression, Ridge, Lasso
from sklearn.neural_network import MLPClassifier
import torch.nn as nn
# L2 Regularization (Ridge): add sum of squared weights
# Loss = L(y, Ε·) + Ξ» Γ Ξ£(wα΅’Β²)
# β Shrinks all weights toward zero, but none become exactly zero
lr_l2 = LogisticRegression(penalty="l2", C=1.0, max_iter=1000)
# L1 Regularization (Lasso): add sum of absolute weights
# Loss = L(y, Ε·) + Ξ» Γ Ξ£|wα΅’|
# β Forces some weights to exactly zero β automatic feature selection
lr_l1 = LogisticRegression(penalty="l1", C=1.0, solver="liblinear", max_iter=1000)
# Elastic Net: L1 + L2 combination
lr_elastic = LogisticRegression(penalty="elasticnet", C=1.0, l1_ratio=0.5,
solver="saga", max_iter=1000)
# Dropout (Neural Networks): randomly zero out activations during training
class RegularizedNet(nn.Module):
def __init__(self):
super().__init__()
self.layers = nn.Sequential(
nn.Linear(20, 64),
nn.ReLU(),
nn.Dropout(0.3), # 30% dropout = regularization
nn.Linear(64, 32),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(32, 1),
)Regularization Strength Selection
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
# Always search for regularization strength with cross-validation
pipeline = Pipeline([
("scaler", StandardScaler()),
("model", LogisticRegression(max_iter=1000)),
])
param_grid = {"model__C": [0.001, 0.01, 0.1, 1, 10, 100]}
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
search = GridSearchCV(pipeline, param_grid, cv=cv, scoring="roc_auc", n_jobs=-1)
search.fit(X_train, y_train)
print(f"Best C: {search.best_params_['model__C']}")
print(f"Best CV AUC: {search.best_score_:.3f}")
# Print all results
for params, mean_score, std_score in zip(
search.cv_results_["params"],
search.cv_results_["mean_test_score"],
search.cv_results_["std_test_score"],
):
print(f" C={params['model__C']:6}: {mean_score:.3f} Β± {std_score:.3f}")When Regularization Helps Most
High feature-to-sample ratio (many features, few samples):
β Without regularization: model uses irrelevant features, overfits
β With regularization: irrelevant weights shrink toward zero
Correlated features:
β Without L2: model assigns arbitrary weights to correlated features
β With L2: weights are spread across correlated features
Sparse signal (few truly predictive features out of many):
β With L1: automatically zeroes out non-predictive features
Neural networks with large parameter counts:
β Without dropout: neurons co-adapt and memorize training data
β With dropout: neurons must learn robust, independent featuresInterview Answer Template
Q: What is regularization and why do we use it?
Regularization adds a penalty term to the loss function that discourages large model weights. Without it, a model minimizes training loss by memorizing noise in the training data β it overfits. The regularized objective forces the model to simultaneously reduce training loss and keep weights small, which pushes it toward simpler solutions that generalize better. The penalty strength Ξ» (or C = 1/Ξ» in sklearn) controls the tradeoff: larger Ξ» means more regularization, smaller models, higher bias, lower variance. Smaller Ξ» means less regularization, more complex models, lower bias, higher variance. The right Ξ» is found by cross-validation. The two main forms are L2 (squared weights β shrinks all weights, keeps all features) and L1 (absolute weights β drives some weights to exactly zero, performing feature selection). For neural networks, dropout serves a similar purpose by randomly disabling neurons during training.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.