Gradient Descent

The Core Idea

Goal: find weights W that minimise loss L(W).

The gradient ∇L(W) points in the direction of steepest increase.
So we move in the opposite direction — the negative gradient.

Update rule:
  W ← W - α · ∇L(W)

where α (learning rate) controls the step size.

Intuition: standing on a hilly landscape in fog.
  You can only feel the slope under your feet (local gradient).
  You step downhill each time.
  Eventually (if lucky) you reach a valley (local minimum).

Vanilla Gradient Descent in NumPy

Python

import numpy as np

def gradient_descent(
    X: np.ndarray,
    y: np.ndarray,
    lr: float = 0.01,
    n_epochs: int = 1000,
) -> tuple[np.ndarray, list[float]]:
    """Gradient descent for linear regression (MSE loss)."""
    n, d = X.shape
    W = np.zeros(d)
    losses = []
    
    for epoch in range(n_epochs):
        # Forward pass — compute predictions and loss
        y_hat = X @ W                    # (n,)
        residuals = y_hat - y            # (n,)
        loss = (residuals ** 2).mean()   # MSE
        losses.append(loss)
        
        # Backward pass — compute gradient
        grad_W = (2 / n) * (X.T @ residuals)  # (d,) — dL/dW
        
        # Weight update
        W = W - lr * grad_W
        
        if epoch % 100 == 0:
            print(f"Epoch {epoch:4d}: loss={loss:.6f}")
    
    return W, losses

# Example: predict readmission from 3 features
np.random.seed(42)
n_patients = 500
X = np.column_stack([
    np.ones(n_patients),                          # bias term
    np.random.normal(65, 15, n_patients),          # age
    np.random.normal(2.5, 0.8, n_patients),        # INR
    np.random.randint(0, 10, n_patients).astype(float),  # n_meds
])
true_W = np.array([0.1, 0.005, 0.08, 0.03])
y = X @ true_W + np.random.normal(0, 0.05, n_patients)

W_learned, losses = gradient_descent(X, y, lr=0.001, n_epochs=1000)
print(f"True W:    {true_W}")
print(f"Learned W: {W_learned.round(3)}")

Loss Landscape

The loss landscape is a high-dimensional surface over weight space.

Good landscapes (convex):
  Linear regression (MSE): one global minimum, guaranteed convergence
  Logistic regression: convex, but slower to converge

Neural network landscapes (non-convex):
  Many local minima, saddle points, flat regions (plateaus)
  
  Saddle point: gradient = 0 but not a minimum
    ↳ gradient descent stalls here without momentum
  
  Plateau: nearly-zero gradient over a wide region
    ↳ very slow progress, can seem stuck
  
  Sharp vs flat minima:
    Sharp minimum: generalises poorly (small weight perturbation → large loss increase)
    Flat minimum: generalises better (robust to noise)
    SGD's noise tends to find flatter minima — why it often outperforms exact GD

Python

import torch
import matplotlib.pyplot as plt

def rosenbrock(x: torch.Tensor, y: torch.Tensor) -> torch.Tensor:
    """Classic non-convex test function: minimum at (1, 1)."""
    return (1 - x) ** 2 + 100 * (y - x ** 2) ** 2

def trace_gradient_descent(
    start: tuple[float, float],
    lr: float = 0.001,
    n_steps: int = 500,
) -> list[tuple[float, float, float]]:
    """Trace GD trajectory on Rosenbrock."""
    x = torch.tensor(start[0], requires_grad=True, dtype=torch.float64)
    y = torch.tensor(start[1], requires_grad=True, dtype=torch.float64)
    
    trajectory = []
    for _ in range(n_steps):
        loss = rosenbrock(x, y)
        trajectory.append((x.item(), y.item(), loss.item()))
        
        loss.backward()
        with torch.no_grad():
            x -= lr * x.grad
            y -= lr * y.grad
        x.grad.zero_()
        y.grad.zero_()
    
    return trajectory

path = trace_gradient_descent((-1.5, 1.5), lr=0.001, n_steps=2000)
final_x, final_y, final_loss = path[-1]
print(f"Final position: ({final_x:.4f}, {final_y:.4f}), loss: {final_loss:.6f}")
# True minimum at (1, 1) with loss 0

PyTorch Training Step

Python

import torch
import torch.nn as nn

class ClinicalRiskModel(nn.Module):
    def __init__(self, n_features: int):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_features, 64),
            nn.ReLU(),
            nn.Linear(64, 32),
            nn.ReLU(),
            nn.Linear(32, 1),
        )
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.net(x)

model = ClinicalRiskModel(n_features=10)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
criterion = nn.BCEWithLogitsLoss()

def gradient_descent_step(
    model: nn.Module,
    X: torch.Tensor,
    y: torch.Tensor,
    optimizer: torch.optim.Optimizer,
    criterion: nn.Module,
) -> float:
    """One step of gradient descent."""
    # 1. Zero gradients (accumulated from previous step)
    optimizer.zero_grad()
    
    # 2. Forward pass — compute predictions
    logits = model(X).squeeze()
    
    # 3. Compute loss
    loss = criterion(logits, y)
    
    # 4. Backward pass — compute gradients via autograd
    loss.backward()
    
    # 5. Update weights — W ← W - lr * grad
    optimizer.step()
    
    return loss.item()

# Inspect gradients after backward
X_batch = torch.randn(32, 10)
y_batch = torch.randint(0, 2, (32,)).float()

optimizer.zero_grad()
loss = criterion(model(X_batch).squeeze(), y_batch)
loss.backward()

for name, param in model.named_parameters():
    if param.grad is not None:
        grad_norm = param.grad.norm().item()
        print(f"{name:30s}: grad norm = {grad_norm:.6f}")

Learning Rate Effect

Python

import torch

def compare_learning_rates(
    lrs: list[float],
    n_epochs: int = 100,
) -> dict[float, list[float]]:
    """Compare convergence speed across learning rates."""
    results = {}
    
    for lr in lrs:
        model = nn.Linear(10, 1)
        optimizer = torch.optim.SGD(model.parameters(), lr=lr)
        criterion = nn.MSELoss()
        
        X = torch.randn(200, 10)
        y = torch.randn(200, 1)
        
        epoch_losses = []
        for _ in range(n_epochs):
            optimizer.zero_grad()
            loss = criterion(model(X), y)
            loss.backward()
            optimizer.step()
            epoch_losses.append(loss.item())
        
        results[lr] = epoch_losses
    
    return results

# Too small: slow convergence
# Too large: oscillates or diverges (loss increases)
# Just right: steady decrease
lrs = [0.0001, 0.001, 0.01, 0.1, 1.0]
history = compare_learning_rates(lrs)

for lr, losses in history.items():
    status = "diverged" if losses[-1] > losses[0] else f"final={losses[-1]:.4f}"
    print(f"lr={lr:.4f}: {status}")

Convergence Criteria

Python

def train_with_convergence_check(
    model: nn.Module,
    loader,
    optimizer,
    criterion,
    max_epochs: int = 1000,
    min_delta: float = 1e-6,
    patience: int = 20,
) -> list[float]:
    """Stop training when loss stops improving."""
    losses = []
    best_loss = float("inf")
    no_improve_count = 0
    
    for epoch in range(max_epochs):
        epoch_loss = 0.0
        n_batches = 0
        
        for X, y in loader:
            optimizer.zero_grad()
            loss = criterion(model(X).squeeze(), y)
            loss.backward()
            optimizer.step()
            epoch_loss += loss.item()
            n_batches += 1
        
        avg_loss = epoch_loss / n_batches
        losses.append(avg_loss)
        
        # Check improvement
        if avg_loss < best_loss - min_delta:
            best_loss = avg_loss
            no_improve_count = 0
        else:
            no_improve_count += 1
        
        if no_improve_count >= patience:
            print(f"Converged at epoch {epoch} (no improvement for {patience} epochs)")
            break
    
    return losses

Interview Answer

"Gradient descent minimises a loss function by iteratively updating weights in the direction opposite to the gradient: W ← W - α·∇L(W). The gradient points uphill; moving against it descends the loss surface. The learning rate α controls step size — too small and convergence is slow, too large and it oscillates or diverges. In PyTorch: zero_grad() clears accumulated gradients, loss.backward() computes them via autograd, optimizer.step() applies the update. For neural networks, the loss landscape is non-convex with saddle points and plateaus — this is why SGD's noise and momentum are essential. Full gradient descent over the entire dataset is computationally impractical for large datasets; mini-batch SGD is used instead. Key insight: the loss landscape's sharp vs flat minima matter for generalisation — flat minima generalise better, and SGD's stochasticity biases toward them."

The Core Idea

Vanilla Gradient Descent in NumPy

Loss Landscape

PyTorch Training Step

Learning Rate Effect

Convergence Criteria

Interview Answer

Enjoyed this article?

Leave a comment