Deep Learning for AI Interviews · Lesson 49 of 56

L1 and L2 Regularization in Neural Networks

The Regularisation Idea

Without regularisation: the model optimises only the data loss.
With regularisation: optimise data loss + penalty on weight magnitude.

L_total = L_data + λ · R(W)

Too large λ: model focuses on the penalty → underfits
Too small λ: penalty ignored → overfits
Just right λ: constrains weight space → better generalisation

L2 Regularisation (Weight Decay)

Python

import torch
import torch.nn as nn
import numpy as np

# L2 penalty: R(W) = Σ w_i²
# Gradient: ∂R/∂w = 2w
# Effect on update: W ← W - α(grad_L + λ·2W) = W(1 - 2αλ) - α·grad_L
# This "decays" weights toward zero each step → name: "weight decay"

# PyTorch implementation: weight_decay parameter in optimizer
model = nn.Sequential(nn.Linear(20, 64), nn.ReLU(), nn.Linear(64, 1))

# L2 regularisation: weight_decay = λ
optimizer_l2 = torch.optim.AdamW(
    model.parameters(),
    lr=1e-3,
    weight_decay=1e-4,  # λ = 0.0001 — small weight decay
)

# What AdamW does vs Adam:
# Adam:  includes weight decay in the gradient (coupled with adaptive lr — incorrect)
# AdamW: applies weight decay directly to weights (decoupled — correct for Adam)

# Manual L2 penalty (for understanding):
def compute_l2_penalty(model: nn.Module, lambda_l2: float = 1e-4) -> torch.Tensor:
    l2_norm = torch.tensor(0.0)
    for param in model.parameters():
        l2_norm += param.pow(2).sum()
    return lambda_l2 * l2_norm

X = torch.randn(32, 20)
y = torch.randint(0, 2, (32,)).float()
criterion = nn.BCEWithLogitsLoss()

optimizer_l2.zero_grad()
loss = criterion(model(X).squeeze(), y)
l2_reg = compute_l2_penalty(model, 1e-4)
total_loss = loss + l2_reg
total_loss.backward()
optimizer_l2.step()

print(f"Data loss: {loss.item():.4f}, L2 penalty: {l2_reg.item():.6f}")

L1 Regularisation (Lasso)

Python

import torch
import torch.nn as nn

# L1 penalty: R(W) = Σ |w_i|
# Gradient: ∂R/∂w = sign(w)
# Effect: constant pull toward zero, regardless of magnitude
# Key difference: L1 promotes SPARSITY — many weights become exactly 0

def compute_l1_penalty(model: nn.Module, lambda_l1: float = 1e-4) -> torch.Tensor:
    l1_norm = torch.tensor(0.0)
    for param in model.parameters():
        l1_norm += param.abs().sum()
    return lambda_l1 * l1_norm

model = nn.Sequential(nn.Linear(20, 64), nn.ReLU(), nn.Linear(64, 1))
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.BCEWithLogitsLoss()

# Training with L1
for step in range(100):
    X = torch.randn(32, 20)
    y = torch.randint(0, 2, (32,)).float()
    
    optimizer.zero_grad()
    loss = criterion(model(X).squeeze(), y)
    l1_reg = compute_l1_penalty(model, lambda_l1=1e-3)
    (loss + l1_reg).backward()
    optimizer.step()

# Check sparsity after training
for name, param in model.named_parameters():
    sparsity = (param.abs() < 1e-4).float().mean().item()
    print(f"{name}: {sparsity:.1%} near-zero weights")

# L1 in practice for neural networks:
# PyTorch has no built-in L1 weight decay in optimizers → add manually to loss
# Note: L1 creates discontinuous gradient at 0 → less common than L2 for deep learning

L1 vs L2: Geometric Interpretation

L2 sphere: sum of squared weights ≤ C (smooth, round ball)
L1 diamond: sum of |weights| ≤ C (has corners at axes)

The optimal solution occurs where the data loss contour first touches the constraint.

L2: typically touches the sphere at a smooth non-axis point → most weights small but non-zero
L1: likely to touch the diamond at a CORNER (on an axis) → many weights exactly 0

This is why:
  L2 (weight decay): all weights small, none exactly zero
  L1 (Lasso): sparse solution — most weights zero, few large weights
  Elastic Net: L1 + L2 combined (both sparse + small)

Python

import numpy as np

# Demonstrate sparsity effect on synthetic regression
np.random.seed(42)
n, d = 100, 50   # underdetermined: fewer samples than features
X = np.random.randn(n, d)
true_w = np.zeros(d)
true_w[[0, 5, 12]] = [1.5, -2.0, 0.8]   # only 3 true features
y = X @ true_w + 0.1 * np.random.randn(n)

from sklearn.linear_model import Ridge, Lasso, ElasticNet

for model_cls, name in [(Ridge(alpha=1.0), "L2 Ridge"), (Lasso(alpha=0.1), "L1 Lasso"),
                         (ElasticNet(alpha=0.1, l1_ratio=0.5), "Elastic Net")]:
    model_cls.fit(X, y)
    w = model_cls.coef_
    n_zero = (np.abs(w) < 0.01).sum()
    print(f"{name:15s}: {n_zero}/{d} weights near-zero, "
          f"true features recovered: {[i for i in [0,5,12] if abs(w[i]) > 0.1]}")

Probabilistic Interpretation

L2 regularisation = MAP estimation with Gaussian prior on weights:
  P(W) ∝ exp(-λ||W||²)  ↔  W ~ N(0, 1/(2λ))
  
  Maximising log P(data|W) + log P(W) is equivalent to minimising
  cross-entropy loss + λ||W||²

L1 regularisation = MAP estimation with Laplace prior on weights:
  P(W) ∝ exp(-λ||W||₁)  ↔  W ~ Laplace(0, 1/λ)
  
  Laplace has heavier tails than Gaussian → allows some weights to be large
  while pushing most to exactly 0 (via the pointy peak at 0)

Python

import numpy as np
import scipy.stats as stats

# Gaussian prior (L2): symmetric bell, smooth
# Laplace prior (L1): symmetric tent, sharp peak at 0

x = np.linspace(-3, 3, 1000)
gaussian = stats.norm.pdf(x, 0, 1)
laplace  = stats.laplace.pdf(x, 0, 1)

print("At w=0:")
print(f"  Gaussian probability density: {gaussian[500]:.4f}")
print(f"  Laplace probability density:  {laplace[500]:.4f}")
print("The Laplace has a sharper peak at 0 → stronger push toward sparsity")

print("\nAt w=2 (large weight):")
print(f"  Gaussian: {stats.norm.pdf(2, 0, 1):.6f}")
print(f"  Laplace:  {stats.laplace.pdf(2, 0, 1):.6f}")
print("Both penalise large weights, but differently")

Choosing λ

Python

import torch
import torch.nn as nn
import numpy as np

def find_best_weight_decay(
    model_fn,
    X_train: torch.Tensor,
    y_train: torch.Tensor,
    X_val: torch.Tensor,
    y_val: torch.Tensor,
    lambda_grid: list[float] = None,
    n_epochs: int = 50,
) -> float:
    """Grid search for optimal weight decay on validation set."""
    lambda_grid = lambda_grid or [0.0, 1e-5, 1e-4, 1e-3, 1e-2]
    criterion = nn.BCEWithLogitsLoss()
    
    best_val_loss = float("inf")
    best_lambda = 0.0
    
    for lam in lambda_grid:
        model = model_fn()
        optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=lam)
        
        for _ in range(n_epochs):
            model.train()
            optimizer.zero_grad()
            loss = criterion(model(X_train).squeeze(), y_train)
            loss.backward()
            optimizer.step()
        
        model.eval()
        with torch.no_grad():
            val_loss = criterion(model(X_val).squeeze(), y_val).item()
        
        print(f"λ={lam:.0e}: val_loss={val_loss:.4f}")
        if val_loss < best_val_loss:
            best_val_loss = val_loss
            best_lambda = lam
    
    print(f"\nBest λ: {best_lambda:.0e}")
    return best_lambda

# Typical ranges to search: [0, 1e-5, 1e-4, 1e-3, 1e-2]
# For clinical models: 1e-4 is a safe default starting point

Interview Answer

"L2 regularisation (weight decay) adds λ·||W||² to the loss, pulling all weights toward zero proportionally to their magnitude — equivalent to a Gaussian prior on weights. L1 regularisation adds λ·||W||₁, pulling weights toward zero with a constant force (gradient = sign(w)) — equivalent to a Laplace prior and promotes exact sparsity (many weights become exactly 0). In deep learning: L2 via AdamW's weight_decay parameter is the standard (use AdamW, not Adam, for correct decoupled L2); L1 must be added manually to the loss and is rarely used alone in neural networks (discontinuous gradient at 0 causes training instability). Typical weight_decay for AdamW: 1e-4 for clinical models — tune on validation set. Probabilistic interpretation: regularisation = MAP estimation with a prior; L2 = Gaussian prior (prefers many small weights), L1 = Laplace prior (prefers a few large weights, rest zero)."

CNN in Production: Latency, Size, and Edge Deploy

Next Lesson

Batch Normalization: How and Why It Works