Deep Learning for AI Interviews · Lesson 34 of 56

Chain Rule: The Math Behind Backprop

The Chain Rule

For composed functions: if L = f(g(x)), then:

  dL/dx = dL/dg × dg/dx

This is the chain rule — the foundation of backpropagation.

Multi-layer example:
  L = loss(a3)
  a3 = relu(z3)     where z3 = a2 @ W3 + b3
  a2 = relu(z2)     where z2 = a1 @ W2 + b2
  a1 = relu(z1)     where z1 = x  @ W1 + b1

dL/dW1 = dL/da3 × da3/dz3 × dz3/da2 × da2/dz2 × dz2/da1 × da1/dz1 × dz1/dW1
       = δ3 × relu'(z3) × W3.T × relu'(z2) × W2.T × relu'(z1) × x

Each factor is a partial derivative from a single operation.
Backprop: compute this product from right-to-left (output → input).

Single Neuron: Chain Rule Step by Step

Python

import torch

# Single neuron: z = w·x + b, a = σ(z), L = (a - y)²
x = torch.tensor(2.0, requires_grad=False)
w = torch.tensor(0.5, requires_grad=True)
b = torch.tensor(-1.0, requires_grad=True)
y = torch.tensor(1.0)   # target

# Forward
z = w * x + b
a = torch.sigmoid(z)
L = (a - y) ** 2

print(f"z = {z.item():.4f}")
print(f"a = {a.item():.4f}")
print(f"L = {L.item():.4f}")

# Backward (autograd)
L.backward()
print(f"\nAutograd: dL/dw = {w.grad.item():.6f}")
print(f"Autograd: dL/db = {b.grad.item():.6f}")

# Manual chain rule verification:
# dL/da = 2(a - y)
# da/dz = a(1 - a)   [sigmoid derivative]
# dz/dw = x
# dz/db = 1
# dL/dw = dL/da × da/dz × dz/dw

a_val = a.item()
dL_da = 2 * (a_val - y.item())
da_dz = a_val * (1 - a_val)
dz_dw = x.item()
dz_db = 1.0

dL_dw_manual = dL_da * da_dz * dz_dw
dL_db_manual = dL_da * da_dz * dz_db

print(f"\nManual:   dL/dw = {dL_dw_manual:.6f}")
print(f"Manual:   dL/db = {dL_db_manual:.6f}")

The Jacobian for Layers

For a linear layer: z = x @ W.T + b
  z is (batch, d_out), x is (batch, d_in)

Partial derivatives:
  dz/dW = x.T   (each row of W depends on corresponding column of x)
  dz/db = 1     (bias gradient is just the delta)
  dz/dx = W     (each output neuron depends on all inputs)

If δ = dL/dz (gradient flowing back from next layer), then:
  dL/dW = x.T @ δ         (outer product summed over batch)
  dL/db = δ.sum(dim=0)    (sum over batch)
  dL/dx = δ @ W           (pass gradient to previous layer)

This is why backprop through a linear layer is another linear operation:
  forward:  z = x @ W.T + b
  backward: δ_prev = δ_curr @ W   (same W, transposed in effect)

Python

import torch

# Verify Jacobian manually for a linear layer
batch, d_in, d_out = 4, 3, 5

x = torch.randn(batch, d_in, requires_grad=True)
W = torch.randn(d_out, d_in, requires_grad=True)
b = torch.randn(d_out, requires_grad=True)

z = x @ W.T + b
L = z.sum()
L.backward()

# Manual derivatives
delta_z = torch.ones(batch, d_out)   # dL/dz = 1 (since L = sum)

dW_manual = delta_z.T @ x           # (d_out, d_in)
db_manual = delta_z.sum(dim=0)       # (d_out,)
dx_manual = delta_z @ W             # (batch, d_in)

print(f"dW error: {(W.grad - dW_manual).abs().max().item():.2e}")
print(f"db error: {(b.grad - db_manual).abs().max().item():.2e}")
print(f"dx error: {(x.grad - dx_manual).abs().max().item():.2e}")
# All should be essentially zero (floating point)

Activation Function Gradients

Python

import torch
import numpy as np

# Each activation function has its own gradient formula

# Sigmoid: a = 1/(1+e^{-z}), da/dz = a(1-a)
def sigmoid_grad(a: torch.Tensor) -> torch.Tensor:
    return a * (1 - a)

# ReLU: a = max(0, z), da/dz = 1 if z > 0 else 0
def relu_grad(z: torch.Tensor) -> torch.Tensor:
    return (z > 0).float()

# Tanh: a = tanh(z), da/dz = 1 - a²
def tanh_grad(a: torch.Tensor) -> torch.Tensor:
    return 1 - a ** 2

z_test = torch.tensor([-2.0, -0.5, 0.0, 0.5, 2.0])

sigmoid_a = torch.sigmoid(z_test)
print("Sigmoid gradients:", sigmoid_grad(sigmoid_a).numpy().round(4))
# Max gradient at z=0 (0.25), saturates near 0 at extremes → vanishing gradient

print("ReLU gradients:", relu_grad(z_test).numpy())
# Binary: 0 for negative z (dead neurons), 1 for positive (no saturation)

tanh_a = torch.tanh(z_test)
print("Tanh gradients:", tanh_grad(tanh_a).numpy().round(4))
# Similar to sigmoid but centred at 0; still saturates

Gradient Accumulation via Chain Rule

Python

import torch
import torch.nn as nn

def trace_gradients(model: nn.Module, X: torch.Tensor, y: torch.Tensor) -> None:
    """Show how gradient norm changes through layers after backward pass."""
    criterion = nn.BCEWithLogitsLoss()
    
    optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
    optimizer.zero_grad()
    
    loss = criterion(model(X).squeeze(), y)
    loss.backward()
    
    print(f"\nGradient norms (output → input):")
    params_reversed = list(reversed(list(model.named_parameters())))
    for name, param in params_reversed:
        if param.grad is not None:
            norm = param.grad.norm().item()
            print(f"  {name:30s}: {norm:.2e}")

# Sigmoid activations: vanishing gradients through many layers
sigmoid_net = nn.Sequential(
    nn.Linear(10, 32), nn.Sigmoid(),
    nn.Linear(32, 32), nn.Sigmoid(),
    nn.Linear(32, 32), nn.Sigmoid(),
    nn.Linear(32, 1),
)

# ReLU activations: better gradient flow
relu_net = nn.Sequential(
    nn.Linear(10, 32), nn.ReLU(),
    nn.Linear(32, 32), nn.ReLU(),
    nn.Linear(32, 32), nn.ReLU(),
    nn.Linear(32, 1),
)

X = torch.randn(16, 10)
y = torch.randint(0, 2, (16,)).float()

print("=== Sigmoid network ===")
trace_gradients(sigmoid_net, X, y)

print("\n=== ReLU network ===")
trace_gradients(relu_net, X, y)
# ReLU should show more consistent gradients across layers

Interview Answer

"The chain rule states dL/dx = (dL/dg) × (dg/dx) for composed functions. Backpropagation applies this recursively: the gradient at layer L is computed from the gradient at layer L+1 multiplied by the local partial derivative at L. For a linear layer z = x @ W.T + b: backward gives dL/dW = x.T @ δ and dL/dx = δ @ W (pass gradient to previous layer) where δ = dL/dz. For activation functions: sigmoid has gradient a(1-a), ReLU has gradient 1 for z > 0 (0 for z ≤ 0). The sigmoid gradient saturates near 0 at ±infinity — multiplying many such terms causes vanishing gradients in deep sigmoid networks. ReLU avoids this because its gradient is either 0 or 1, not a decaying fraction. PyTorch autograd implements the chain rule automatically by recording operations on a dynamic computation graph and traversing it in reverse during backward()."

Backpropagation Explained Step by Step

Next Lesson

Vanishing Gradient Problem