Learnixo
Back to blog
AI Systemsintermediate

L1 and L2 Regularisation

How L1 and L2 weight penalties prevent overfitting, their probabilistic interpretation as priors, and when to use each.

Asma Hafeez KhanMay 22, 20266 min read
Deep LearningRegularisationL1L2Weight DecayInterview
Share:𝕏

The Regularisation Idea

Without regularisation: the model optimises only the data loss.
With regularisation: optimise data loss + penalty on weight magnitude.

L_total = L_data + λ · R(W)

Too large λ: model focuses on the penalty → underfits
Too small λ: penalty ignored → overfits
Just right λ: constrains weight space → better generalisation

L2 Regularisation (Weight Decay)

Python
import torch
import torch.nn as nn
import numpy as np

# L2 penalty: R(W) = Σ w_i²
# Gradient: ∂R/∂w = 2w
# Effect on update: W  W - α(grad_L + λ·2W) = W(1 - 2αλ) - α·grad_L
# This "decays" weights toward zero each step  name: "weight decay"

# PyTorch implementation: weight_decay parameter in optimizer
model = nn.Sequential(nn.Linear(20, 64), nn.ReLU(), nn.Linear(64, 1))

# L2 regularisation: weight_decay = λ
optimizer_l2 = torch.optim.AdamW(
    model.parameters(),
    lr=1e-3,
    weight_decay=1e-4,  # λ = 0.0001  small weight decay
)

# What AdamW does vs Adam:
# Adam:  includes weight decay in the gradient (coupled with adaptive lr  incorrect)
# AdamW: applies weight decay directly to weights (decoupled  correct for Adam)

# Manual L2 penalty (for understanding):
def compute_l2_penalty(model: nn.Module, lambda_l2: float = 1e-4) -> torch.Tensor:
    l2_norm = torch.tensor(0.0)
    for param in model.parameters():
        l2_norm += param.pow(2).sum()
    return lambda_l2 * l2_norm

X = torch.randn(32, 20)
y = torch.randint(0, 2, (32,)).float()
criterion = nn.BCEWithLogitsLoss()

optimizer_l2.zero_grad()
loss = criterion(model(X).squeeze(), y)
l2_reg = compute_l2_penalty(model, 1e-4)
total_loss = loss + l2_reg
total_loss.backward()
optimizer_l2.step()

print(f"Data loss: {loss.item():.4f}, L2 penalty: {l2_reg.item():.6f}")

L1 Regularisation (Lasso)

Python
import torch
import torch.nn as nn

# L1 penalty: R(W) = Σ |w_i|
# Gradient: ∂R/∂w = sign(w)
# Effect: constant pull toward zero, regardless of magnitude
# Key difference: L1 promotes SPARSITY  many weights become exactly 0

def compute_l1_penalty(model: nn.Module, lambda_l1: float = 1e-4) -> torch.Tensor:
    l1_norm = torch.tensor(0.0)
    for param in model.parameters():
        l1_norm += param.abs().sum()
    return lambda_l1 * l1_norm

model = nn.Sequential(nn.Linear(20, 64), nn.ReLU(), nn.Linear(64, 1))
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.BCEWithLogitsLoss()

# Training with L1
for step in range(100):
    X = torch.randn(32, 20)
    y = torch.randint(0, 2, (32,)).float()
    
    optimizer.zero_grad()
    loss = criterion(model(X).squeeze(), y)
    l1_reg = compute_l1_penalty(model, lambda_l1=1e-3)
    (loss + l1_reg).backward()
    optimizer.step()

# Check sparsity after training
for name, param in model.named_parameters():
    sparsity = (param.abs() < 1e-4).float().mean().item()
    print(f"{name}: {sparsity:.1%} near-zero weights")

# L1 in practice for neural networks:
# PyTorch has no built-in L1 weight decay in optimizers  add manually to loss
# Note: L1 creates discontinuous gradient at 0  less common than L2 for deep learning

L1 vs L2: Geometric Interpretation

L2 sphere: sum of squared weights ≤ C (smooth, round ball)
L1 diamond: sum of |weights| ≤ C (has corners at axes)

The optimal solution occurs where the data loss contour first touches the constraint.

L2: typically touches the sphere at a smooth non-axis point → most weights small but non-zero
L1: likely to touch the diamond at a CORNER (on an axis) → many weights exactly 0

This is why:
  L2 (weight decay): all weights small, none exactly zero
  L1 (Lasso): sparse solution — most weights zero, few large weights
  Elastic Net: L1 + L2 combined (both sparse + small)
Python
import numpy as np

# Demonstrate sparsity effect on synthetic regression
np.random.seed(42)
n, d = 100, 50   # underdetermined: fewer samples than features
X = np.random.randn(n, d)
true_w = np.zeros(d)
true_w[[0, 5, 12]] = [1.5, -2.0, 0.8]   # only 3 true features
y = X @ true_w + 0.1 * np.random.randn(n)

from sklearn.linear_model import Ridge, Lasso, ElasticNet

for model_cls, name in [(Ridge(alpha=1.0), "L2 Ridge"), (Lasso(alpha=0.1), "L1 Lasso"),
                         (ElasticNet(alpha=0.1, l1_ratio=0.5), "Elastic Net")]:
    model_cls.fit(X, y)
    w = model_cls.coef_
    n_zero = (np.abs(w) < 0.01).sum()
    print(f"{name:15s}: {n_zero}/{d} weights near-zero, "
          f"true features recovered: {[i for i in [0,5,12] if abs(w[i]) > 0.1]}")

Probabilistic Interpretation

L2 regularisation = MAP estimation with Gaussian prior on weights:
  P(W) ∝ exp(-λ||W||²)  ↔  W ~ N(0, 1/(2λ))
  
  Maximising log P(data|W) + log P(W) is equivalent to minimising
  cross-entropy loss + λ||W||²

L1 regularisation = MAP estimation with Laplace prior on weights:
  P(W) ∝ exp(-λ||W||₁)  ↔  W ~ Laplace(0, 1/λ)
  
  Laplace has heavier tails than Gaussian → allows some weights to be large
  while pushing most to exactly 0 (via the pointy peak at 0)
Python
import numpy as np
import scipy.stats as stats

# Gaussian prior (L2): symmetric bell, smooth
# Laplace prior (L1): symmetric tent, sharp peak at 0

x = np.linspace(-3, 3, 1000)
gaussian = stats.norm.pdf(x, 0, 1)
laplace  = stats.laplace.pdf(x, 0, 1)

print("At w=0:")
print(f"  Gaussian probability density: {gaussian[500]:.4f}")
print(f"  Laplace probability density:  {laplace[500]:.4f}")
print("The Laplace has a sharper peak at 0 → stronger push toward sparsity")

print("\nAt w=2 (large weight):")
print(f"  Gaussian: {stats.norm.pdf(2, 0, 1):.6f}")
print(f"  Laplace:  {stats.laplace.pdf(2, 0, 1):.6f}")
print("Both penalise large weights, but differently")

Choosing λ

Python
import torch
import torch.nn as nn
import numpy as np

def find_best_weight_decay(
    model_fn,
    X_train: torch.Tensor,
    y_train: torch.Tensor,
    X_val: torch.Tensor,
    y_val: torch.Tensor,
    lambda_grid: list[float] = None,
    n_epochs: int = 50,
) -> float:
    """Grid search for optimal weight decay on validation set."""
    lambda_grid = lambda_grid or [0.0, 1e-5, 1e-4, 1e-3, 1e-2]
    criterion = nn.BCEWithLogitsLoss()
    
    best_val_loss = float("inf")
    best_lambda = 0.0
    
    for lam in lambda_grid:
        model = model_fn()
        optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=lam)
        
        for _ in range(n_epochs):
            model.train()
            optimizer.zero_grad()
            loss = criterion(model(X_train).squeeze(), y_train)
            loss.backward()
            optimizer.step()
        
        model.eval()
        with torch.no_grad():
            val_loss = criterion(model(X_val).squeeze(), y_val).item()
        
        print(f"λ={lam:.0e}: val_loss={val_loss:.4f}")
        if val_loss < best_val_loss:
            best_val_loss = val_loss
            best_lambda = lam
    
    print(f"\nBest λ: {best_lambda:.0e}")
    return best_lambda

# Typical ranges to search: [0, 1e-5, 1e-4, 1e-3, 1e-2]
# For clinical models: 1e-4 is a safe default starting point

Interview Answer

"L2 regularisation (weight decay) adds λ·||W||² to the loss, pulling all weights toward zero proportionally to their magnitude — equivalent to a Gaussian prior on weights. L1 regularisation adds λ·||W||₁, pulling weights toward zero with a constant force (gradient = sign(w)) — equivalent to a Laplace prior and promotes exact sparsity (many weights become exactly 0). In deep learning: L2 via AdamW's weight_decay parameter is the standard (use AdamW, not Adam, for correct decoupled L2); L1 must be added manually to the loss and is rarely used alone in neural networks (discontinuous gradient at 0 causes training instability). Typical weight_decay for AdamW: 1e-4 for clinical models — tune on validation set. Probabilistic interpretation: regularisation = MAP estimation with a prior; L2 = Gaussian prior (prefers many small weights), L1 = Laplace prior (prefers a few large weights, rest zero)."

Enjoyed this article?

Explore the AI Systems learning path for more.

Found this helpful?

Share:𝕏

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.