Adam Optimizer
How Adam combines momentum and adaptive learning rates — the math behind m_t, v_t, bias correction, and when to use Adam vs SGD.
Adam in Plain Terms
Adam = Adaptive Moment Estimation
Two key ideas combined:
1. Momentum: smooth out noisy gradients by averaging over recent history
2. Adaptive learning rates: each weight gets its own effective lr,
scaled down for frequently-updated weights, up for rarely-updated ones
This makes Adam:
✓ Fast to converge (momentum helps)
✓ Works well without tuning lr (adaptive scaling)
✓ Good for sparse gradients (NLP, embeddings)
✗ May generalise slightly worse than SGD+momentum on vision tasksThe Algorithm
Given: gradient g_t at step t
Hyperparameters: α (lr), β₁=0.9, β₂=0.999, ε=1e-8
1. First moment (momentum / mean of gradients):
m_t = β₁ · m_{t-1} + (1 - β₁) · g_t
2. Second moment (uncentred variance of gradients):
v_t = β₂ · v_{t-1} + (1 - β₂) · g_t²
3. Bias correction (compensates for zero initialisation):
m̂_t = m_t / (1 - β₁ᵗ)
v̂_t = v_t / (1 - β₂ᵗ)
4. Weight update:
θ_t = θ_{t-1} - α · m̂_t / (√v̂_t + ε)
Intuition of the denominator:
If gradient has been large and consistent → v̂_t is large → small effective lr
If gradient has been small/rare → v̂_t is small → large effective lrAdam from Scratch
import numpy as np
class Adam:
def __init__(
self,
lr: float = 1e-3,
beta1: float = 0.9,
beta2: float = 0.999,
eps: float = 1e-8,
weight_decay: float = 0.0,
):
self.lr = lr
self.beta1 = beta1
self.beta2 = beta2
self.eps = eps
self.weight_decay = weight_decay
self.t = 0
self.m = {} # first moments
self.v = {} # second moments
def step(self, params: dict[str, np.ndarray], grads: dict[str, np.ndarray]) -> None:
self.t += 1
for name, param in params.items():
g = grads[name]
# Optional L2 regularisation (decoupled weight decay)
if self.weight_decay > 0:
g = g + self.weight_decay * param
# Initialise moments
if name not in self.m:
self.m[name] = np.zeros_like(param)
self.v[name] = np.zeros_like(param)
# Update moments
self.m[name] = self.beta1 * self.m[name] + (1 - self.beta1) * g
self.v[name] = self.beta2 * self.v[name] + (1 - self.beta2) * g ** 2
# Bias correction
m_hat = self.m[name] / (1 - self.beta1 ** self.t)
v_hat = self.v[name] / (1 - self.beta2 ** self.t)
# Update weights
params[name] -= self.lr * m_hat / (np.sqrt(v_hat) + self.eps)
# Verify: compare to analytical solution on quadratic
np.random.seed(42)
w = {"W": np.array([3.0, -2.0, 1.5])} # start far from minimum
optimizer = Adam(lr=0.01)
for step in range(200):
# Loss = ||W||² → gradient = 2W
g = {"W": 2.0 * w["W"]}
optimizer.step(w, g)
if step % 50 == 0:
print(f"Step {step:3d}: W = {w['W'].round(4)}, loss = {(w['W']**2).sum():.6f}")PyTorch Adam and AdamW
import torch
import torch.nn as nn
model = nn.Sequential(
nn.Linear(20, 64),
nn.ReLU(),
nn.Linear(64, 1),
)
# Standard Adam
optimizer_adam = torch.optim.Adam(
model.parameters(),
lr=1e-3,
betas=(0.9, 0.999), # β₁, β₂ (almost never changed)
eps=1e-8,
weight_decay=0, # L2 penalty in gradient (coupled — less correct)
)
# AdamW: decoupled weight decay (the correct implementation)
optimizer_adamw = torch.optim.AdamW(
model.parameters(),
lr=1e-3,
betas=(0.9, 0.999),
eps=1e-8,
weight_decay=1e-4, # applied directly to weights, not gradients
)
# Difference: Adam's weight decay is coupled with adaptive lr (incorrect)
# AdamW's weight decay is decoupled — always prefer AdamW
# Training step — same for both
X = torch.randn(64, 20)
y = torch.randint(0, 2, (64,)).float()
criterion = nn.BCEWithLogitsLoss()
optimizer_adamw.zero_grad()
loss = criterion(model(X).squeeze(), y)
loss.backward()
optimizer_adamw.step()
print(f"Loss: {loss.item():.4f}")Per-Parameter Learning Rates
# Different lr for different parameter groups (common for fine-tuning)
import torchvision.models as models
backbone = models.resnet18(pretrained=False)
optimizer = torch.optim.AdamW([
{"params": backbone.layer1.parameters(), "lr": 1e-5}, # early layers: tiny lr
{"params": backbone.layer2.parameters(), "lr": 1e-5},
{"params": backbone.layer3.parameters(), "lr": 1e-4}, # deeper layers: moderate lr
{"params": backbone.layer4.parameters(), "lr": 1e-4},
{"params": backbone.fc.parameters(), "lr": 1e-3}, # head: normal lr
], lr=1e-3, weight_decay=1e-4)
# This is "discriminative learning rates" — common in transfer learning:
# earlier layers have more generic features, need less updatingAdam vs SGD
Scenario | Winner | Why
------------------------------|---------|------------------------------------------
NLP / Transformers | Adam | Sparse gradients, embedding layers benefit from adaptive lr
Computer vision (from scratch)| Toss-up | Adam converges faster, SGD may generalise better
Transfer learning fine-tuning | Adam | Few steps, adaptive lr helps
Small datasets | SGD+mom | Adam's adaptivity can overfit; SGD with momentum is more conservative
Hyperparameter sensitivity | Adam | Much less sensitive to lr than SGD
Production training budget | Adam | Less tuning needed
Empirical observation (many papers):
Adam finds a good solution faster than SGD.
SGD+momentum often finds a slightly better solution given enough time.
For most practitioners: use AdamW; switch to SGD if you need that last 0.5%.Monitoring Adam Internals
def inspect_adam_state(optimizer: torch.optim.Adam, model: nn.Module) -> None:
"""Print Adam's moment estimates for each parameter."""
for i, (name, param) in enumerate(model.named_parameters()):
if param.grad is None:
continue
state = optimizer.state[param]
if not state:
continue
step = state["step"]
exp_avg = state["exp_avg"].abs().mean().item() # |m_t|
exp_avg_sq = state["exp_avg_sq"].mean().item() # v_t
effective_lr = optimizer.param_groups[0]["lr"] / (exp_avg_sq ** 0.5 + 1e-8)
print(f"{name:30s}: step={step}, |m|={exp_avg:.6f}, v={exp_avg_sq:.8f}, eff_lr≈{effective_lr:.6f}")
# Call after at least one optimizer.step()Interview Answer
"Adam maintains two running statistics per weight: the first moment m_t (exponential moving average of gradients — like momentum) and the second moment v_t (EMA of squared gradients — like a per-weight variance estimate). The update is θ ← θ - α · m̂_t / (√v̂_t + ε). The denominator scales down the learning rate for weights that receive large or consistent gradients, and scales up for rarely-updated weights — this is adaptive learning rates. The bias-correction terms (dividing by 1 - β^t) compensate for zero initialisation in early steps. AdamW is the correct variant: it applies weight decay directly to weights rather than coupling it into the gradient, which matters when using adaptive learning rates. In practice: AdamW is the default for transformers and NLP; SGD with momentum sometimes edges out AdamW on vision tasks given sufficient training time."
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.