L1 and L2 Regularisation
How L1 and L2 weight penalties prevent overfitting, their probabilistic interpretation as priors, and when to use each.
The Regularisation Idea
Without regularisation: the model optimises only the data loss.
With regularisation: optimise data loss + penalty on weight magnitude.
L_total = L_data + λ · R(W)
Too large λ: model focuses on the penalty → underfits
Too small λ: penalty ignored → overfits
Just right λ: constrains weight space → better generalisationL2 Regularisation (Weight Decay)
import torch
import torch.nn as nn
import numpy as np
# L2 penalty: R(W) = Σ w_i²
# Gradient: ∂R/∂w = 2w
# Effect on update: W ← W - α(grad_L + λ·2W) = W(1 - 2αλ) - α·grad_L
# This "decays" weights toward zero each step → name: "weight decay"
# PyTorch implementation: weight_decay parameter in optimizer
model = nn.Sequential(nn.Linear(20, 64), nn.ReLU(), nn.Linear(64, 1))
# L2 regularisation: weight_decay = λ
optimizer_l2 = torch.optim.AdamW(
model.parameters(),
lr=1e-3,
weight_decay=1e-4, # λ = 0.0001 — small weight decay
)
# What AdamW does vs Adam:
# Adam: includes weight decay in the gradient (coupled with adaptive lr — incorrect)
# AdamW: applies weight decay directly to weights (decoupled — correct for Adam)
# Manual L2 penalty (for understanding):
def compute_l2_penalty(model: nn.Module, lambda_l2: float = 1e-4) -> torch.Tensor:
l2_norm = torch.tensor(0.0)
for param in model.parameters():
l2_norm += param.pow(2).sum()
return lambda_l2 * l2_norm
X = torch.randn(32, 20)
y = torch.randint(0, 2, (32,)).float()
criterion = nn.BCEWithLogitsLoss()
optimizer_l2.zero_grad()
loss = criterion(model(X).squeeze(), y)
l2_reg = compute_l2_penalty(model, 1e-4)
total_loss = loss + l2_reg
total_loss.backward()
optimizer_l2.step()
print(f"Data loss: {loss.item():.4f}, L2 penalty: {l2_reg.item():.6f}")L1 Regularisation (Lasso)
import torch
import torch.nn as nn
# L1 penalty: R(W) = Σ |w_i|
# Gradient: ∂R/∂w = sign(w)
# Effect: constant pull toward zero, regardless of magnitude
# Key difference: L1 promotes SPARSITY — many weights become exactly 0
def compute_l1_penalty(model: nn.Module, lambda_l1: float = 1e-4) -> torch.Tensor:
l1_norm = torch.tensor(0.0)
for param in model.parameters():
l1_norm += param.abs().sum()
return lambda_l1 * l1_norm
model = nn.Sequential(nn.Linear(20, 64), nn.ReLU(), nn.Linear(64, 1))
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.BCEWithLogitsLoss()
# Training with L1
for step in range(100):
X = torch.randn(32, 20)
y = torch.randint(0, 2, (32,)).float()
optimizer.zero_grad()
loss = criterion(model(X).squeeze(), y)
l1_reg = compute_l1_penalty(model, lambda_l1=1e-3)
(loss + l1_reg).backward()
optimizer.step()
# Check sparsity after training
for name, param in model.named_parameters():
sparsity = (param.abs() < 1e-4).float().mean().item()
print(f"{name}: {sparsity:.1%} near-zero weights")
# L1 in practice for neural networks:
# PyTorch has no built-in L1 weight decay in optimizers → add manually to loss
# Note: L1 creates discontinuous gradient at 0 → less common than L2 for deep learningL1 vs L2: Geometric Interpretation
L2 sphere: sum of squared weights ≤ C (smooth, round ball)
L1 diamond: sum of |weights| ≤ C (has corners at axes)
The optimal solution occurs where the data loss contour first touches the constraint.
L2: typically touches the sphere at a smooth non-axis point → most weights small but non-zero
L1: likely to touch the diamond at a CORNER (on an axis) → many weights exactly 0
This is why:
L2 (weight decay): all weights small, none exactly zero
L1 (Lasso): sparse solution — most weights zero, few large weights
Elastic Net: L1 + L2 combined (both sparse + small)import numpy as np
# Demonstrate sparsity effect on synthetic regression
np.random.seed(42)
n, d = 100, 50 # underdetermined: fewer samples than features
X = np.random.randn(n, d)
true_w = np.zeros(d)
true_w[[0, 5, 12]] = [1.5, -2.0, 0.8] # only 3 true features
y = X @ true_w + 0.1 * np.random.randn(n)
from sklearn.linear_model import Ridge, Lasso, ElasticNet
for model_cls, name in [(Ridge(alpha=1.0), "L2 Ridge"), (Lasso(alpha=0.1), "L1 Lasso"),
(ElasticNet(alpha=0.1, l1_ratio=0.5), "Elastic Net")]:
model_cls.fit(X, y)
w = model_cls.coef_
n_zero = (np.abs(w) < 0.01).sum()
print(f"{name:15s}: {n_zero}/{d} weights near-zero, "
f"true features recovered: {[i for i in [0,5,12] if abs(w[i]) > 0.1]}")Probabilistic Interpretation
L2 regularisation = MAP estimation with Gaussian prior on weights:
P(W) ∝ exp(-λ||W||²) ↔ W ~ N(0, 1/(2λ))
Maximising log P(data|W) + log P(W) is equivalent to minimising
cross-entropy loss + λ||W||²
L1 regularisation = MAP estimation with Laplace prior on weights:
P(W) ∝ exp(-λ||W||₁) ↔ W ~ Laplace(0, 1/λ)
Laplace has heavier tails than Gaussian → allows some weights to be large
while pushing most to exactly 0 (via the pointy peak at 0)import numpy as np
import scipy.stats as stats
# Gaussian prior (L2): symmetric bell, smooth
# Laplace prior (L1): symmetric tent, sharp peak at 0
x = np.linspace(-3, 3, 1000)
gaussian = stats.norm.pdf(x, 0, 1)
laplace = stats.laplace.pdf(x, 0, 1)
print("At w=0:")
print(f" Gaussian probability density: {gaussian[500]:.4f}")
print(f" Laplace probability density: {laplace[500]:.4f}")
print("The Laplace has a sharper peak at 0 → stronger push toward sparsity")
print("\nAt w=2 (large weight):")
print(f" Gaussian: {stats.norm.pdf(2, 0, 1):.6f}")
print(f" Laplace: {stats.laplace.pdf(2, 0, 1):.6f}")
print("Both penalise large weights, but differently")Choosing λ
import torch
import torch.nn as nn
import numpy as np
def find_best_weight_decay(
model_fn,
X_train: torch.Tensor,
y_train: torch.Tensor,
X_val: torch.Tensor,
y_val: torch.Tensor,
lambda_grid: list[float] = None,
n_epochs: int = 50,
) -> float:
"""Grid search for optimal weight decay on validation set."""
lambda_grid = lambda_grid or [0.0, 1e-5, 1e-4, 1e-3, 1e-2]
criterion = nn.BCEWithLogitsLoss()
best_val_loss = float("inf")
best_lambda = 0.0
for lam in lambda_grid:
model = model_fn()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=lam)
for _ in range(n_epochs):
model.train()
optimizer.zero_grad()
loss = criterion(model(X_train).squeeze(), y_train)
loss.backward()
optimizer.step()
model.eval()
with torch.no_grad():
val_loss = criterion(model(X_val).squeeze(), y_val).item()
print(f"λ={lam:.0e}: val_loss={val_loss:.4f}")
if val_loss < best_val_loss:
best_val_loss = val_loss
best_lambda = lam
print(f"\nBest λ: {best_lambda:.0e}")
return best_lambda
# Typical ranges to search: [0, 1e-5, 1e-4, 1e-3, 1e-2]
# For clinical models: 1e-4 is a safe default starting pointInterview Answer
"L2 regularisation (weight decay) adds λ·||W||² to the loss, pulling all weights toward zero proportionally to their magnitude — equivalent to a Gaussian prior on weights. L1 regularisation adds λ·||W||₁, pulling weights toward zero with a constant force (gradient = sign(w)) — equivalent to a Laplace prior and promotes exact sparsity (many weights become exactly 0). In deep learning: L2 via AdamW's weight_decay parameter is the standard (use AdamW, not Adam, for correct decoupled L2); L1 must be added manually to the loss and is rarely used alone in neural networks (discontinuous gradient at 0 causes training instability). Typical weight_decay for AdamW: 1e-4 for clinical models — tune on validation set. Probabilistic interpretation: regularisation = MAP estimation with a prior; L2 = Gaussian prior (prefers many small weights), L1 = Laplace prior (prefers a few large weights, rest zero)."
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.