Deep Learning for AI Interviews · Lesson 34 of 56
Chain Rule: The Math Behind Backprop
The Chain Rule
For composed functions: if L = f(g(x)), then:
dL/dx = dL/dg × dg/dx
This is the chain rule — the foundation of backpropagation.
Multi-layer example:
L = loss(a3)
a3 = relu(z3) where z3 = a2 @ W3 + b3
a2 = relu(z2) where z2 = a1 @ W2 + b2
a1 = relu(z1) where z1 = x @ W1 + b1
dL/dW1 = dL/da3 × da3/dz3 × dz3/da2 × da2/dz2 × dz2/da1 × da1/dz1 × dz1/dW1
= δ3 × relu'(z3) × W3.T × relu'(z2) × W2.T × relu'(z1) × x
Each factor is a partial derivative from a single operation.
Backprop: compute this product from right-to-left (output → input).Single Neuron: Chain Rule Step by Step
import torch
# Single neuron: z = w·x + b, a = σ(z), L = (a - y)²
x = torch.tensor(2.0, requires_grad=False)
w = torch.tensor(0.5, requires_grad=True)
b = torch.tensor(-1.0, requires_grad=True)
y = torch.tensor(1.0) # target
# Forward
z = w * x + b
a = torch.sigmoid(z)
L = (a - y) ** 2
print(f"z = {z.item():.4f}")
print(f"a = {a.item():.4f}")
print(f"L = {L.item():.4f}")
# Backward (autograd)
L.backward()
print(f"\nAutograd: dL/dw = {w.grad.item():.6f}")
print(f"Autograd: dL/db = {b.grad.item():.6f}")
# Manual chain rule verification:
# dL/da = 2(a - y)
# da/dz = a(1 - a) [sigmoid derivative]
# dz/dw = x
# dz/db = 1
# dL/dw = dL/da × da/dz × dz/dw
a_val = a.item()
dL_da = 2 * (a_val - y.item())
da_dz = a_val * (1 - a_val)
dz_dw = x.item()
dz_db = 1.0
dL_dw_manual = dL_da * da_dz * dz_dw
dL_db_manual = dL_da * da_dz * dz_db
print(f"\nManual: dL/dw = {dL_dw_manual:.6f}")
print(f"Manual: dL/db = {dL_db_manual:.6f}")The Jacobian for Layers
For a linear layer: z = x @ W.T + b
z is (batch, d_out), x is (batch, d_in)
Partial derivatives:
dz/dW = x.T (each row of W depends on corresponding column of x)
dz/db = 1 (bias gradient is just the delta)
dz/dx = W (each output neuron depends on all inputs)
If δ = dL/dz (gradient flowing back from next layer), then:
dL/dW = x.T @ δ (outer product summed over batch)
dL/db = δ.sum(dim=0) (sum over batch)
dL/dx = δ @ W (pass gradient to previous layer)
This is why backprop through a linear layer is another linear operation:
forward: z = x @ W.T + b
backward: δ_prev = δ_curr @ W (same W, transposed in effect)import torch
# Verify Jacobian manually for a linear layer
batch, d_in, d_out = 4, 3, 5
x = torch.randn(batch, d_in, requires_grad=True)
W = torch.randn(d_out, d_in, requires_grad=True)
b = torch.randn(d_out, requires_grad=True)
z = x @ W.T + b
L = z.sum()
L.backward()
# Manual derivatives
delta_z = torch.ones(batch, d_out) # dL/dz = 1 (since L = sum)
dW_manual = delta_z.T @ x # (d_out, d_in)
db_manual = delta_z.sum(dim=0) # (d_out,)
dx_manual = delta_z @ W # (batch, d_in)
print(f"dW error: {(W.grad - dW_manual).abs().max().item():.2e}")
print(f"db error: {(b.grad - db_manual).abs().max().item():.2e}")
print(f"dx error: {(x.grad - dx_manual).abs().max().item():.2e}")
# All should be essentially zero (floating point)Activation Function Gradients
import torch
import numpy as np
# Each activation function has its own gradient formula
# Sigmoid: a = 1/(1+e^{-z}), da/dz = a(1-a)
def sigmoid_grad(a: torch.Tensor) -> torch.Tensor:
return a * (1 - a)
# ReLU: a = max(0, z), da/dz = 1 if z > 0 else 0
def relu_grad(z: torch.Tensor) -> torch.Tensor:
return (z > 0).float()
# Tanh: a = tanh(z), da/dz = 1 - a²
def tanh_grad(a: torch.Tensor) -> torch.Tensor:
return 1 - a ** 2
z_test = torch.tensor([-2.0, -0.5, 0.0, 0.5, 2.0])
sigmoid_a = torch.sigmoid(z_test)
print("Sigmoid gradients:", sigmoid_grad(sigmoid_a).numpy().round(4))
# Max gradient at z=0 (0.25), saturates near 0 at extremes → vanishing gradient
print("ReLU gradients:", relu_grad(z_test).numpy())
# Binary: 0 for negative z (dead neurons), 1 for positive (no saturation)
tanh_a = torch.tanh(z_test)
print("Tanh gradients:", tanh_grad(tanh_a).numpy().round(4))
# Similar to sigmoid but centred at 0; still saturatesGradient Accumulation via Chain Rule
import torch
import torch.nn as nn
def trace_gradients(model: nn.Module, X: torch.Tensor, y: torch.Tensor) -> None:
"""Show how gradient norm changes through layers after backward pass."""
criterion = nn.BCEWithLogitsLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
optimizer.zero_grad()
loss = criterion(model(X).squeeze(), y)
loss.backward()
print(f"\nGradient norms (output → input):")
params_reversed = list(reversed(list(model.named_parameters())))
for name, param in params_reversed:
if param.grad is not None:
norm = param.grad.norm().item()
print(f" {name:30s}: {norm:.2e}")
# Sigmoid activations: vanishing gradients through many layers
sigmoid_net = nn.Sequential(
nn.Linear(10, 32), nn.Sigmoid(),
nn.Linear(32, 32), nn.Sigmoid(),
nn.Linear(32, 32), nn.Sigmoid(),
nn.Linear(32, 1),
)
# ReLU activations: better gradient flow
relu_net = nn.Sequential(
nn.Linear(10, 32), nn.ReLU(),
nn.Linear(32, 32), nn.ReLU(),
nn.Linear(32, 32), nn.ReLU(),
nn.Linear(32, 1),
)
X = torch.randn(16, 10)
y = torch.randint(0, 2, (16,)).float()
print("=== Sigmoid network ===")
trace_gradients(sigmoid_net, X, y)
print("\n=== ReLU network ===")
trace_gradients(relu_net, X, y)
# ReLU should show more consistent gradients across layersInterview Answer
"The chain rule states dL/dx = (dL/dg) × (dg/dx) for composed functions. Backpropagation applies this recursively: the gradient at layer L is computed from the gradient at layer L+1 multiplied by the local partial derivative at L. For a linear layer z = x @ W.T + b: backward gives dL/dW = x.T @ δ and dL/dx = δ @ W (pass gradient to previous layer) where δ = dL/dz. For activation functions: sigmoid has gradient a(1-a), ReLU has gradient 1 for z > 0 (0 for z ≤ 0). The sigmoid gradient saturates near 0 at ±infinity — multiplying many such terms causes vanishing gradients in deep sigmoid networks. ReLU avoids this because its gradient is either 0 or 1, not a decaying fraction. PyTorch autograd implements the chain rule automatically by recording operations on a dynamic computation graph and traversing it in reverse during backward()."