Adam Optimizer

Adam in Plain Terms

Adam = Adaptive Moment Estimation

Two key ideas combined:
  1. Momentum: smooth out noisy gradients by averaging over recent history
  2. Adaptive learning rates: each weight gets its own effective lr,
     scaled down for frequently-updated weights, up for rarely-updated ones

This makes Adam:
  ✓ Fast to converge (momentum helps)
  ✓ Works well without tuning lr (adaptive scaling)
  ✓ Good for sparse gradients (NLP, embeddings)
  ✗ May generalise slightly worse than SGD+momentum on vision tasks

The Algorithm

Given: gradient g_t at step t
Hyperparameters: α (lr), β₁=0.9, β₂=0.999, ε=1e-8

1. First moment (momentum / mean of gradients):
   m_t = β₁ · m_{t-1} + (1 - β₁) · g_t

2. Second moment (uncentred variance of gradients):
   v_t = β₂ · v_{t-1} + (1 - β₂) · g_t²

3. Bias correction (compensates for zero initialisation):
   m̂_t = m_t / (1 - β₁ᵗ)
   v̂_t = v_t / (1 - β₂ᵗ)

4. Weight update:
   θ_t = θ_{t-1} - α · m̂_t / (√v̂_t + ε)

Intuition of the denominator:
  If gradient has been large and consistent → v̂_t is large → small effective lr
  If gradient has been small/rare → v̂_t is small → large effective lr

Adam from Scratch

Python

import numpy as np

class Adam:
    def __init__(
        self,
        lr: float = 1e-3,
        beta1: float = 0.9,
        beta2: float = 0.999,
        eps: float = 1e-8,
        weight_decay: float = 0.0,
    ):
        self.lr = lr
        self.beta1 = beta1
        self.beta2 = beta2
        self.eps = eps
        self.weight_decay = weight_decay
        self.t = 0
        self.m = {}  # first moments
        self.v = {}  # second moments
    
    def step(self, params: dict[str, np.ndarray], grads: dict[str, np.ndarray]) -> None:
        self.t += 1
        
        for name, param in params.items():
            g = grads[name]
            
            # Optional L2 regularisation (decoupled weight decay)
            if self.weight_decay > 0:
                g = g + self.weight_decay * param
            
            # Initialise moments
            if name not in self.m:
                self.m[name] = np.zeros_like(param)
                self.v[name] = np.zeros_like(param)
            
            # Update moments
            self.m[name] = self.beta1 * self.m[name] + (1 - self.beta1) * g
            self.v[name] = self.beta2 * self.v[name] + (1 - self.beta2) * g ** 2
            
            # Bias correction
            m_hat = self.m[name] / (1 - self.beta1 ** self.t)
            v_hat = self.v[name] / (1 - self.beta2 ** self.t)
            
            # Update weights
            params[name] -= self.lr * m_hat / (np.sqrt(v_hat) + self.eps)

# Verify: compare to analytical solution on quadratic
np.random.seed(42)
w = {"W": np.array([3.0, -2.0, 1.5])}  # start far from minimum
optimizer = Adam(lr=0.01)

for step in range(200):
    # Loss = ||W||² → gradient = 2W
    g = {"W": 2.0 * w["W"]}
    optimizer.step(w, g)
    if step % 50 == 0:
        print(f"Step {step:3d}: W = {w['W'].round(4)}, loss = {(w['W']**2).sum():.6f}")

PyTorch Adam and AdamW

Python

import torch
import torch.nn as nn

model = nn.Sequential(
    nn.Linear(20, 64),
    nn.ReLU(),
    nn.Linear(64, 1),
)

# Standard Adam
optimizer_adam = torch.optim.Adam(
    model.parameters(),
    lr=1e-3,
    betas=(0.9, 0.999),   # β₁, β₂ (almost never changed)
    eps=1e-8,
    weight_decay=0,        # L2 penalty in gradient (coupled — less correct)
)

# AdamW: decoupled weight decay (the correct implementation)
optimizer_adamw = torch.optim.AdamW(
    model.parameters(),
    lr=1e-3,
    betas=(0.9, 0.999),
    eps=1e-8,
    weight_decay=1e-4,    # applied directly to weights, not gradients
)

# Difference: Adam's weight decay is coupled with adaptive lr (incorrect)
#             AdamW's weight decay is decoupled — always prefer AdamW

# Training step — same for both
X = torch.randn(64, 20)
y = torch.randint(0, 2, (64,)).float()
criterion = nn.BCEWithLogitsLoss()

optimizer_adamw.zero_grad()
loss = criterion(model(X).squeeze(), y)
loss.backward()
optimizer_adamw.step()
print(f"Loss: {loss.item():.4f}")

Per-Parameter Learning Rates

Python

# Different lr for different parameter groups (common for fine-tuning)
import torchvision.models as models

backbone = models.resnet18(pretrained=False)

optimizer = torch.optim.AdamW([
    {"params": backbone.layer1.parameters(), "lr": 1e-5},   # early layers: tiny lr
    {"params": backbone.layer2.parameters(), "lr": 1e-5},
    {"params": backbone.layer3.parameters(), "lr": 1e-4},   # deeper layers: moderate lr
    {"params": backbone.layer4.parameters(), "lr": 1e-4},
    {"params": backbone.fc.parameters(),     "lr": 1e-3},   # head: normal lr
], lr=1e-3, weight_decay=1e-4)

# This is "discriminative learning rates" — common in transfer learning:
# earlier layers have more generic features, need less updating

Adam vs SGD

Scenario                      | Winner  | Why
------------------------------|---------|------------------------------------------
NLP / Transformers            | Adam    | Sparse gradients, embedding layers benefit from adaptive lr
Computer vision (from scratch)| Toss-up | Adam converges faster, SGD may generalise better
Transfer learning fine-tuning  | Adam    | Few steps, adaptive lr helps
Small datasets                | SGD+mom | Adam's adaptivity can overfit; SGD with momentum is more conservative
Hyperparameter sensitivity    | Adam    | Much less sensitive to lr than SGD
Production training budget    | Adam    | Less tuning needed

Empirical observation (many papers):
  Adam finds a good solution faster than SGD.
  SGD+momentum often finds a slightly better solution given enough time.
  For most practitioners: use AdamW; switch to SGD if you need that last 0.5%.

Monitoring Adam Internals

Python

def inspect_adam_state(optimizer: torch.optim.Adam, model: nn.Module) -> None:
    """Print Adam's moment estimates for each parameter."""
    for i, (name, param) in enumerate(model.named_parameters()):
        if param.grad is None:
            continue
        state = optimizer.state[param]
        if not state:
            continue
        
        step = state["step"]
        exp_avg = state["exp_avg"].abs().mean().item()     # |m_t|
        exp_avg_sq = state["exp_avg_sq"].mean().item()     # v_t
        effective_lr = optimizer.param_groups[0]["lr"] / (exp_avg_sq ** 0.5 + 1e-8)
        
        print(f"{name:30s}: step={step}, |m|={exp_avg:.6f}, v={exp_avg_sq:.8f}, eff_lr≈{effective_lr:.6f}")

# Call after at least one optimizer.step()

Interview Answer

"Adam maintains two running statistics per weight: the first moment m_t (exponential moving average of gradients — like momentum) and the second moment v_t (EMA of squared gradients — like a per-weight variance estimate). The update is θ ← θ - α · m̂_t / (√v̂_t + ε). The denominator scales down the learning rate for weights that receive large or consistent gradients, and scales up for rarely-updated weights — this is adaptive learning rates. The bias-correction terms (dividing by 1 - β^t) compensate for zero initialisation in early steps. AdamW is the correct variant: it applies weight decay directly to weights rather than coupling it into the gradient, which matters when using adaptive learning rates. In practice: AdamW is the default for transformers and NLP; SGD with momentum sometimes edges out AdamW on vision tasks given sufficient training time."

Adam in Plain Terms

The Algorithm

Adam from Scratch

PyTorch Adam and AdamW

Per-Parameter Learning Rates

Adam vs SGD

Monitoring Adam Internals

Interview Answer

Enjoyed this article?

Leave a comment