Learnixo
Back to blog
AI Systemsintermediate

Gradient Descent

How gradient descent minimises a loss function by following the negative gradient โ€” the core algorithm behind all neural network training.

Asma Hafeez KhanMay 22, 20266 min read
Deep LearningGradient DescentOptimisationTrainingInterview
Share:๐•

The Core Idea

Goal: find weights W that minimise loss L(W).

The gradient โˆ‡L(W) points in the direction of steepest increase.
So we move in the opposite direction โ€” the negative gradient.

Update rule:
  W โ† W - ฮฑ ยท โˆ‡L(W)

where ฮฑ (learning rate) controls the step size.

Intuition: standing on a hilly landscape in fog.
  You can only feel the slope under your feet (local gradient).
  You step downhill each time.
  Eventually (if lucky) you reach a valley (local minimum).

Vanilla Gradient Descent in NumPy

Python
import numpy as np

def gradient_descent(
    X: np.ndarray,
    y: np.ndarray,
    lr: float = 0.01,
    n_epochs: int = 1000,
) -> tuple[np.ndarray, list[float]]:
    """Gradient descent for linear regression (MSE loss)."""
    n, d = X.shape
    W = np.zeros(d)
    losses = []
    
    for epoch in range(n_epochs):
        # Forward pass โ€” compute predictions and loss
        y_hat = X @ W                    # (n,)
        residuals = y_hat - y            # (n,)
        loss = (residuals ** 2).mean()   # MSE
        losses.append(loss)
        
        # Backward pass โ€” compute gradient
        grad_W = (2 / n) * (X.T @ residuals)  # (d,) โ€” dL/dW
        
        # Weight update
        W = W - lr * grad_W
        
        if epoch % 100 == 0:
            print(f"Epoch {epoch:4d}: loss={loss:.6f}")
    
    return W, losses

# Example: predict readmission from 3 features
np.random.seed(42)
n_patients = 500
X = np.column_stack([
    np.ones(n_patients),                          # bias term
    np.random.normal(65, 15, n_patients),          # age
    np.random.normal(2.5, 0.8, n_patients),        # INR
    np.random.randint(0, 10, n_patients).astype(float),  # n_meds
])
true_W = np.array([0.1, 0.005, 0.08, 0.03])
y = X @ true_W + np.random.normal(0, 0.05, n_patients)

W_learned, losses = gradient_descent(X, y, lr=0.001, n_epochs=1000)
print(f"True W:    {true_W}")
print(f"Learned W: {W_learned.round(3)}")

Loss Landscape

The loss landscape is a high-dimensional surface over weight space.

Good landscapes (convex):
  Linear regression (MSE): one global minimum, guaranteed convergence
  Logistic regression: convex, but slower to converge

Neural network landscapes (non-convex):
  Many local minima, saddle points, flat regions (plateaus)
  
  Saddle point: gradient = 0 but not a minimum
    โ†ณ gradient descent stalls here without momentum
  
  Plateau: nearly-zero gradient over a wide region
    โ†ณ very slow progress, can seem stuck
  
  Sharp vs flat minima:
    Sharp minimum: generalises poorly (small weight perturbation โ†’ large loss increase)
    Flat minimum: generalises better (robust to noise)
    SGD's noise tends to find flatter minima โ€” why it often outperforms exact GD
Python
import torch
import matplotlib.pyplot as plt

def rosenbrock(x: torch.Tensor, y: torch.Tensor) -> torch.Tensor:
    """Classic non-convex test function: minimum at (1, 1)."""
    return (1 - x) ** 2 + 100 * (y - x ** 2) ** 2

def trace_gradient_descent(
    start: tuple[float, float],
    lr: float = 0.001,
    n_steps: int = 500,
) -> list[tuple[float, float, float]]:
    """Trace GD trajectory on Rosenbrock."""
    x = torch.tensor(start[0], requires_grad=True, dtype=torch.float64)
    y = torch.tensor(start[1], requires_grad=True, dtype=torch.float64)
    
    trajectory = []
    for _ in range(n_steps):
        loss = rosenbrock(x, y)
        trajectory.append((x.item(), y.item(), loss.item()))
        
        loss.backward()
        with torch.no_grad():
            x -= lr * x.grad
            y -= lr * y.grad
        x.grad.zero_()
        y.grad.zero_()
    
    return trajectory

path = trace_gradient_descent((-1.5, 1.5), lr=0.001, n_steps=2000)
final_x, final_y, final_loss = path[-1]
print(f"Final position: ({final_x:.4f}, {final_y:.4f}), loss: {final_loss:.6f}")
# True minimum at (1, 1) with loss 0

PyTorch Training Step

Python
import torch
import torch.nn as nn

class ClinicalRiskModel(nn.Module):
    def __init__(self, n_features: int):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_features, 64),
            nn.ReLU(),
            nn.Linear(64, 32),
            nn.ReLU(),
            nn.Linear(32, 1),
        )
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.net(x)

model = ClinicalRiskModel(n_features=10)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
criterion = nn.BCEWithLogitsLoss()

def gradient_descent_step(
    model: nn.Module,
    X: torch.Tensor,
    y: torch.Tensor,
    optimizer: torch.optim.Optimizer,
    criterion: nn.Module,
) -> float:
    """One step of gradient descent."""
    # 1. Zero gradients (accumulated from previous step)
    optimizer.zero_grad()
    
    # 2. Forward pass โ€” compute predictions
    logits = model(X).squeeze()
    
    # 3. Compute loss
    loss = criterion(logits, y)
    
    # 4. Backward pass โ€” compute gradients via autograd
    loss.backward()
    
    # 5. Update weights โ€” W โ† W - lr * grad
    optimizer.step()
    
    return loss.item()

# Inspect gradients after backward
X_batch = torch.randn(32, 10)
y_batch = torch.randint(0, 2, (32,)).float()

optimizer.zero_grad()
loss = criterion(model(X_batch).squeeze(), y_batch)
loss.backward()

for name, param in model.named_parameters():
    if param.grad is not None:
        grad_norm = param.grad.norm().item()
        print(f"{name:30s}: grad norm = {grad_norm:.6f}")

Learning Rate Effect

Python
import torch

def compare_learning_rates(
    lrs: list[float],
    n_epochs: int = 100,
) -> dict[float, list[float]]:
    """Compare convergence speed across learning rates."""
    results = {}
    
    for lr in lrs:
        model = nn.Linear(10, 1)
        optimizer = torch.optim.SGD(model.parameters(), lr=lr)
        criterion = nn.MSELoss()
        
        X = torch.randn(200, 10)
        y = torch.randn(200, 1)
        
        epoch_losses = []
        for _ in range(n_epochs):
            optimizer.zero_grad()
            loss = criterion(model(X), y)
            loss.backward()
            optimizer.step()
            epoch_losses.append(loss.item())
        
        results[lr] = epoch_losses
    
    return results

# Too small: slow convergence
# Too large: oscillates or diverges (loss increases)
# Just right: steady decrease
lrs = [0.0001, 0.001, 0.01, 0.1, 1.0]
history = compare_learning_rates(lrs)

for lr, losses in history.items():
    status = "diverged" if losses[-1] > losses[0] else f"final={losses[-1]:.4f}"
    print(f"lr={lr:.4f}: {status}")

Convergence Criteria

Python
def train_with_convergence_check(
    model: nn.Module,
    loader,
    optimizer,
    criterion,
    max_epochs: int = 1000,
    min_delta: float = 1e-6,
    patience: int = 20,
) -> list[float]:
    """Stop training when loss stops improving."""
    losses = []
    best_loss = float("inf")
    no_improve_count = 0
    
    for epoch in range(max_epochs):
        epoch_loss = 0.0
        n_batches = 0
        
        for X, y in loader:
            optimizer.zero_grad()
            loss = criterion(model(X).squeeze(), y)
            loss.backward()
            optimizer.step()
            epoch_loss += loss.item()
            n_batches += 1
        
        avg_loss = epoch_loss / n_batches
        losses.append(avg_loss)
        
        # Check improvement
        if avg_loss < best_loss - min_delta:
            best_loss = avg_loss
            no_improve_count = 0
        else:
            no_improve_count += 1
        
        if no_improve_count >= patience:
            print(f"Converged at epoch {epoch} (no improvement for {patience} epochs)")
            break
    
    return losses

Interview Answer

"Gradient descent minimises a loss function by iteratively updating weights in the direction opposite to the gradient: W โ† W - ฮฑยทโˆ‡L(W). The gradient points uphill; moving against it descends the loss surface. The learning rate ฮฑ controls step size โ€” too small and convergence is slow, too large and it oscillates or diverges. In PyTorch: zero_grad() clears accumulated gradients, loss.backward() computes them via autograd, optimizer.step() applies the update. For neural networks, the loss landscape is non-convex with saddle points and plateaus โ€” this is why SGD's noise and momentum are essential. Full gradient descent over the entire dataset is computationally impractical for large datasets; mini-batch SGD is used instead. Key insight: the loss landscape's sharp vs flat minima matter for generalisation โ€” flat minima generalise better, and SGD's stochasticity biases toward them."

Enjoyed this article?

Explore the AI Systems learning path for more.

Found this helpful?

Share:๐•

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.