Deep Learning for AI Interviews · Lesson 31 of 56

Network Capacity and Expressiveness

What Capacity Means

Network capacity = how complex a function the network can represent.

Higher capacity:
  ✓ Can fit more complex patterns in the data
  ✗ Can also memorise noise (overfit)
  ✗ More parameters → more training data needed

Lower capacity:
  ✓ Resistant to memorisation (regularised by architecture)
  ✗ May be unable to capture true patterns (underfit)

Capacity is controlled by:
  - Number of parameters (width × depth)
  - Architecture type (MLP vs CNN vs Transformer)
  - Regularisation (Dropout, weight decay reduce effective capacity)
  - Data augmentation (artificially increases effective dataset size)

Diagnosing Capacity Problems

Python

import torch
import torch.nn as nn
import numpy as np

def diagnose_capacity(
    train_losses: list[float],
    val_losses: list[float],
    threshold_underfit: float = 0.1,
    threshold_overfit: float = 0.05,
) -> str:
    """Diagnose underfitting or overfitting from loss curves."""
    final_train = train_losses[-1]
    final_val   = val_losses[-1]
    min_val     = min(val_losses)
    
    # High training loss → underfitting
    if final_train > threshold_underfit:
        return f"UNDERFITTING: train_loss={final_train:.4f} is high. Increase capacity or train longer."
    
    # Val loss much higher than train loss → overfitting
    gap = final_val - final_train
    if gap > threshold_overfit:
        return f"OVERFITTING: gap={gap:.4f} (val - train). Reduce capacity, add regularisation, or get more data."
    
    # Val loss stopped improving → early stopping point
    if final_val > min_val * 1.1:
        return f"TRAINING TOO LONG: best val_loss={min_val:.4f} was at epoch {val_losses.index(min_val)+1}."
    
    return f"WELL-FIT: train={final_train:.4f}, val={final_val:.4f}, gap={gap:.4f}"

# Simulate underfitting scenario
np.random.seed(42)
n_epochs = 50
underfit_train  = [0.8 - 0.005 * i for i in range(n_epochs)]
underfit_val    = [0.82 - 0.004 * i for i in range(n_epochs)]
print(diagnose_capacity(underfit_train, underfit_val))

# Simulate overfitting scenario
overfit_train = [1.0 - 0.02 * i for i in range(n_epochs)]
overfit_val   = [0.8 - 0.01 * i + 0.015 * max(0, i - 20) for i in range(n_epochs)]
print(diagnose_capacity(overfit_train, overfit_val))

Measuring Effective Capacity: VC Dimension

VC dimension (Vapnik-Chervonenkis dimension):
  The largest dataset size that the model can shatter (correctly classify
  in all possible label configurations).
  
  Linear classifier in d dimensions: VC dim ≈ d + 1
  Single hidden layer with h neurons: VC dim ≈ O(h × d)
  Deep network: VC dim ≈ O(W × L) where W=params, L=depth
  
Practical implication:
  You need roughly 10× your VC dimension in training samples
  to guarantee good generalisation (PAC learning bound).
  
  But deep networks often generalise with far fewer samples
  because:
  - SGD has an implicit regularisation effect
  - Architecture encodes useful inductive biases
  - The actual function class is much smaller than VC bound suggests

Capacity vs Dataset Size

Python

import torch
import torch.nn as nn
from torch.utils.data import TensorDataset, DataLoader

def capacity_experiment(
    n_samples: int,
    n_features: int = 20,
    hidden_dims: list[int] = None,
    n_epochs: int = 100,
) -> dict:
    """Train a model and return train/val metrics."""
    hidden_dims = hidden_dims or [64, 32]
    
    # Synthetic dataset with known structure
    torch.manual_seed(42)
    X = torch.randn(n_samples, n_features)
    true_w = torch.randn(n_features)
    y = (X @ true_w + 0.5 * torch.randn(n_samples)).sigmoid().round()
    
    split = int(0.8 * n_samples)
    X_train, X_val = X[:split], X[split:]
    y_train, y_val = y[:split], y[split:]
    
    train_loader = DataLoader(TensorDataset(X_train, y_train), batch_size=32, shuffle=True)
    
    # Build model
    dims = [n_features] + hidden_dims + [1]
    layers = []
    for in_d, out_d in zip(dims[:-1], dims[1:]):
        layers.extend([nn.Linear(in_d, out_d), nn.ReLU()])
    layers = layers[:-1]  # remove last ReLU
    model = nn.Sequential(*layers)
    
    optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
    criterion = nn.BCEWithLogitsLoss()
    
    for _ in range(n_epochs):
        model.train()
        for Xb, yb in train_loader:
            optimizer.zero_grad()
            criterion(model(Xb).squeeze(), yb).backward()
            optimizer.step()
    
    model.eval()
    with torch.no_grad():
        train_loss = criterion(model(X_train).squeeze(), y_train).item()
        val_loss   = criterion(model(X_val).squeeze(), y_val).item()
    
    n_params = sum(p.numel() for p in model.parameters())
    return {"n_params": n_params, "train_loss": train_loss, "val_loss": val_loss, "gap": val_loss - train_loss}

# Vary dataset size for the same architecture
print(f"{'n_samples':>10} {'n_params':>8} {'train':>8} {'val':>8} {'gap':>8} {'diagnosis':>15}")
for n in [200, 500, 1000, 5000]:
    result = capacity_experiment(n_samples=n, hidden_dims=[128, 64, 32])
    gap = result["gap"]
    diag = "overfit" if gap > 0.05 else "well-fit"
    print(f"{n:>10} {result['n_params']:>8,} {result['train_loss']:>8.4f} {result['val_loss']:>8.4f} {gap:>8.4f} {diag:>15}")

Regularisation as Capacity Control

Python

import torch
import torch.nn as nn

# Same architecture, different effective capacity through regularisation

base_arch = [64, 32]
n_features = 20

def make_model(dropout: float, weight_decay: float) -> tuple:
    dims = [n_features] + base_arch + [1]
    layers = []
    for in_d, out_d in zip(dims[:-2], dims[1:-1]):
        layers.extend([nn.Linear(in_d, out_d), nn.ReLU(), nn.Dropout(dropout)])
    layers.append(nn.Linear(base_arch[-1], 1))
    model = nn.Sequential(*layers)
    
    optimizer = torch.optim.AdamW(
        model.parameters(),
        lr=1e-3,
        weight_decay=weight_decay,  # L2 regularisation
    )
    return model, optimizer

# Low regularisation → high effective capacity (risk of overfit)
m_low,  opt_low  = make_model(dropout=0.0, weight_decay=0.0)
# High regularisation → lower effective capacity (more conservative)
m_high, opt_high = make_model(dropout=0.5, weight_decay=1e-3)

print("Low regularisation: large effective capacity")
print("High regularisation: smaller effective capacity (more conservative)")

Scaling Rules

When to add capacity:
  1. Train loss > val loss gap is small, but both losses are high → underfit
     → Add more layers or wider layers
  
  2. Model performs well on easy examples but struggles with hard ones
     → Increase capacity
  
  3. Dataset is large (>100K samples) and complex
     → Larger models consistently improve with more data

When to reduce capacity / add regularisation:
  1. Train loss much lower than val loss → overfit
     → Increase Dropout, add weight decay, or reduce architecture size
  
  2. Dataset is small (<10K samples)
     → Prefer smaller models or pre-trained models with few trainable parameters
  
  3. Features have known structure that a simpler model respects
     → Use the right inductive bias (e.g., linear model for linear relationships)

Interview Answer

"Network capacity refers to the complexity of functions a network can represent, primarily controlled by parameter count (width × depth). Diagnosing capacity: if train loss is high → underfitting (increase capacity); if train loss is much lower than val loss → overfitting (reduce capacity or add regularisation). Regularisation reduces effective capacity without changing architecture: Dropout randomly disables neurons, weight decay penalises large weights (L2), and data augmentation effectively increases dataset size. The key rule: capacity should be matched to dataset size — a 50M-parameter model on 1,000 samples will memorise; a 10K-parameter model on 1M samples will underfit. In clinical settings with limited labelled data, start conservative (small architecture, strong regularisation) and verify on holdout data before adding capacity. Fine-tuning a pre-trained model is often better than training large capacity from scratch."

Universal Approximation Theorem

Next Lesson

Interview: Design a Neural Network for a Given Task