Gradient Descent
How gradient descent minimises a loss function by following the negative gradient โ the core algorithm behind all neural network training.
The Core Idea
Goal: find weights W that minimise loss L(W).
The gradient โL(W) points in the direction of steepest increase.
So we move in the opposite direction โ the negative gradient.
Update rule:
W โ W - ฮฑ ยท โL(W)
where ฮฑ (learning rate) controls the step size.
Intuition: standing on a hilly landscape in fog.
You can only feel the slope under your feet (local gradient).
You step downhill each time.
Eventually (if lucky) you reach a valley (local minimum).Vanilla Gradient Descent in NumPy
import numpy as np
def gradient_descent(
X: np.ndarray,
y: np.ndarray,
lr: float = 0.01,
n_epochs: int = 1000,
) -> tuple[np.ndarray, list[float]]:
"""Gradient descent for linear regression (MSE loss)."""
n, d = X.shape
W = np.zeros(d)
losses = []
for epoch in range(n_epochs):
# Forward pass โ compute predictions and loss
y_hat = X @ W # (n,)
residuals = y_hat - y # (n,)
loss = (residuals ** 2).mean() # MSE
losses.append(loss)
# Backward pass โ compute gradient
grad_W = (2 / n) * (X.T @ residuals) # (d,) โ dL/dW
# Weight update
W = W - lr * grad_W
if epoch % 100 == 0:
print(f"Epoch {epoch:4d}: loss={loss:.6f}")
return W, losses
# Example: predict readmission from 3 features
np.random.seed(42)
n_patients = 500
X = np.column_stack([
np.ones(n_patients), # bias term
np.random.normal(65, 15, n_patients), # age
np.random.normal(2.5, 0.8, n_patients), # INR
np.random.randint(0, 10, n_patients).astype(float), # n_meds
])
true_W = np.array([0.1, 0.005, 0.08, 0.03])
y = X @ true_W + np.random.normal(0, 0.05, n_patients)
W_learned, losses = gradient_descent(X, y, lr=0.001, n_epochs=1000)
print(f"True W: {true_W}")
print(f"Learned W: {W_learned.round(3)}")Loss Landscape
The loss landscape is a high-dimensional surface over weight space.
Good landscapes (convex):
Linear regression (MSE): one global minimum, guaranteed convergence
Logistic regression: convex, but slower to converge
Neural network landscapes (non-convex):
Many local minima, saddle points, flat regions (plateaus)
Saddle point: gradient = 0 but not a minimum
โณ gradient descent stalls here without momentum
Plateau: nearly-zero gradient over a wide region
โณ very slow progress, can seem stuck
Sharp vs flat minima:
Sharp minimum: generalises poorly (small weight perturbation โ large loss increase)
Flat minimum: generalises better (robust to noise)
SGD's noise tends to find flatter minima โ why it often outperforms exact GDimport torch
import matplotlib.pyplot as plt
def rosenbrock(x: torch.Tensor, y: torch.Tensor) -> torch.Tensor:
"""Classic non-convex test function: minimum at (1, 1)."""
return (1 - x) ** 2 + 100 * (y - x ** 2) ** 2
def trace_gradient_descent(
start: tuple[float, float],
lr: float = 0.001,
n_steps: int = 500,
) -> list[tuple[float, float, float]]:
"""Trace GD trajectory on Rosenbrock."""
x = torch.tensor(start[0], requires_grad=True, dtype=torch.float64)
y = torch.tensor(start[1], requires_grad=True, dtype=torch.float64)
trajectory = []
for _ in range(n_steps):
loss = rosenbrock(x, y)
trajectory.append((x.item(), y.item(), loss.item()))
loss.backward()
with torch.no_grad():
x -= lr * x.grad
y -= lr * y.grad
x.grad.zero_()
y.grad.zero_()
return trajectory
path = trace_gradient_descent((-1.5, 1.5), lr=0.001, n_steps=2000)
final_x, final_y, final_loss = path[-1]
print(f"Final position: ({final_x:.4f}, {final_y:.4f}), loss: {final_loss:.6f}")
# True minimum at (1, 1) with loss 0PyTorch Training Step
import torch
import torch.nn as nn
class ClinicalRiskModel(nn.Module):
def __init__(self, n_features: int):
super().__init__()
self.net = nn.Sequential(
nn.Linear(n_features, 64),
nn.ReLU(),
nn.Linear(64, 32),
nn.ReLU(),
nn.Linear(32, 1),
)
def forward(self, x: torch.Tensor) -> torch.Tensor:
return self.net(x)
model = ClinicalRiskModel(n_features=10)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
criterion = nn.BCEWithLogitsLoss()
def gradient_descent_step(
model: nn.Module,
X: torch.Tensor,
y: torch.Tensor,
optimizer: torch.optim.Optimizer,
criterion: nn.Module,
) -> float:
"""One step of gradient descent."""
# 1. Zero gradients (accumulated from previous step)
optimizer.zero_grad()
# 2. Forward pass โ compute predictions
logits = model(X).squeeze()
# 3. Compute loss
loss = criterion(logits, y)
# 4. Backward pass โ compute gradients via autograd
loss.backward()
# 5. Update weights โ W โ W - lr * grad
optimizer.step()
return loss.item()
# Inspect gradients after backward
X_batch = torch.randn(32, 10)
y_batch = torch.randint(0, 2, (32,)).float()
optimizer.zero_grad()
loss = criterion(model(X_batch).squeeze(), y_batch)
loss.backward()
for name, param in model.named_parameters():
if param.grad is not None:
grad_norm = param.grad.norm().item()
print(f"{name:30s}: grad norm = {grad_norm:.6f}")Learning Rate Effect
import torch
def compare_learning_rates(
lrs: list[float],
n_epochs: int = 100,
) -> dict[float, list[float]]:
"""Compare convergence speed across learning rates."""
results = {}
for lr in lrs:
model = nn.Linear(10, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=lr)
criterion = nn.MSELoss()
X = torch.randn(200, 10)
y = torch.randn(200, 1)
epoch_losses = []
for _ in range(n_epochs):
optimizer.zero_grad()
loss = criterion(model(X), y)
loss.backward()
optimizer.step()
epoch_losses.append(loss.item())
results[lr] = epoch_losses
return results
# Too small: slow convergence
# Too large: oscillates or diverges (loss increases)
# Just right: steady decrease
lrs = [0.0001, 0.001, 0.01, 0.1, 1.0]
history = compare_learning_rates(lrs)
for lr, losses in history.items():
status = "diverged" if losses[-1] > losses[0] else f"final={losses[-1]:.4f}"
print(f"lr={lr:.4f}: {status}")Convergence Criteria
def train_with_convergence_check(
model: nn.Module,
loader,
optimizer,
criterion,
max_epochs: int = 1000,
min_delta: float = 1e-6,
patience: int = 20,
) -> list[float]:
"""Stop training when loss stops improving."""
losses = []
best_loss = float("inf")
no_improve_count = 0
for epoch in range(max_epochs):
epoch_loss = 0.0
n_batches = 0
for X, y in loader:
optimizer.zero_grad()
loss = criterion(model(X).squeeze(), y)
loss.backward()
optimizer.step()
epoch_loss += loss.item()
n_batches += 1
avg_loss = epoch_loss / n_batches
losses.append(avg_loss)
# Check improvement
if avg_loss < best_loss - min_delta:
best_loss = avg_loss
no_improve_count = 0
else:
no_improve_count += 1
if no_improve_count >= patience:
print(f"Converged at epoch {epoch} (no improvement for {patience} epochs)")
break
return lossesInterview Answer
"Gradient descent minimises a loss function by iteratively updating weights in the direction opposite to the gradient: W โ W - ฮฑยทโL(W). The gradient points uphill; moving against it descends the loss surface. The learning rate ฮฑ controls step size โ too small and convergence is slow, too large and it oscillates or diverges. In PyTorch: zero_grad() clears accumulated gradients, loss.backward() computes them via autograd, optimizer.step() applies the update. For neural networks, the loss landscape is non-convex with saddle points and plateaus โ this is why SGD's noise and momentum are essential. Full gradient descent over the entire dataset is computationally impractical for large datasets; mini-batch SGD is used instead. Key insight: the loss landscape's sharp vs flat minima matter for generalisation โ flat minima generalise better, and SGD's stochasticity biases toward them."
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.