Deep Learning for AI Interviews · Lesson 31 of 56
Network Capacity and Expressiveness
What Capacity Means
Network capacity = how complex a function the network can represent.
Higher capacity:
✓ Can fit more complex patterns in the data
✗ Can also memorise noise (overfit)
✗ More parameters → more training data needed
Lower capacity:
✓ Resistant to memorisation (regularised by architecture)
✗ May be unable to capture true patterns (underfit)
Capacity is controlled by:
- Number of parameters (width × depth)
- Architecture type (MLP vs CNN vs Transformer)
- Regularisation (Dropout, weight decay reduce effective capacity)
- Data augmentation (artificially increases effective dataset size)Diagnosing Capacity Problems
import torch
import torch.nn as nn
import numpy as np
def diagnose_capacity(
train_losses: list[float],
val_losses: list[float],
threshold_underfit: float = 0.1,
threshold_overfit: float = 0.05,
) -> str:
"""Diagnose underfitting or overfitting from loss curves."""
final_train = train_losses[-1]
final_val = val_losses[-1]
min_val = min(val_losses)
# High training loss → underfitting
if final_train > threshold_underfit:
return f"UNDERFITTING: train_loss={final_train:.4f} is high. Increase capacity or train longer."
# Val loss much higher than train loss → overfitting
gap = final_val - final_train
if gap > threshold_overfit:
return f"OVERFITTING: gap={gap:.4f} (val - train). Reduce capacity, add regularisation, or get more data."
# Val loss stopped improving → early stopping point
if final_val > min_val * 1.1:
return f"TRAINING TOO LONG: best val_loss={min_val:.4f} was at epoch {val_losses.index(min_val)+1}."
return f"WELL-FIT: train={final_train:.4f}, val={final_val:.4f}, gap={gap:.4f}"
# Simulate underfitting scenario
np.random.seed(42)
n_epochs = 50
underfit_train = [0.8 - 0.005 * i for i in range(n_epochs)]
underfit_val = [0.82 - 0.004 * i for i in range(n_epochs)]
print(diagnose_capacity(underfit_train, underfit_val))
# Simulate overfitting scenario
overfit_train = [1.0 - 0.02 * i for i in range(n_epochs)]
overfit_val = [0.8 - 0.01 * i + 0.015 * max(0, i - 20) for i in range(n_epochs)]
print(diagnose_capacity(overfit_train, overfit_val))Measuring Effective Capacity: VC Dimension
VC dimension (Vapnik-Chervonenkis dimension):
The largest dataset size that the model can shatter (correctly classify
in all possible label configurations).
Linear classifier in d dimensions: VC dim ≈ d + 1
Single hidden layer with h neurons: VC dim ≈ O(h × d)
Deep network: VC dim ≈ O(W × L) where W=params, L=depth
Practical implication:
You need roughly 10× your VC dimension in training samples
to guarantee good generalisation (PAC learning bound).
But deep networks often generalise with far fewer samples
because:
- SGD has an implicit regularisation effect
- Architecture encodes useful inductive biases
- The actual function class is much smaller than VC bound suggestsCapacity vs Dataset Size
import torch
import torch.nn as nn
from torch.utils.data import TensorDataset, DataLoader
def capacity_experiment(
n_samples: int,
n_features: int = 20,
hidden_dims: list[int] = None,
n_epochs: int = 100,
) -> dict:
"""Train a model and return train/val metrics."""
hidden_dims = hidden_dims or [64, 32]
# Synthetic dataset with known structure
torch.manual_seed(42)
X = torch.randn(n_samples, n_features)
true_w = torch.randn(n_features)
y = (X @ true_w + 0.5 * torch.randn(n_samples)).sigmoid().round()
split = int(0.8 * n_samples)
X_train, X_val = X[:split], X[split:]
y_train, y_val = y[:split], y[split:]
train_loader = DataLoader(TensorDataset(X_train, y_train), batch_size=32, shuffle=True)
# Build model
dims = [n_features] + hidden_dims + [1]
layers = []
for in_d, out_d in zip(dims[:-1], dims[1:]):
layers.extend([nn.Linear(in_d, out_d), nn.ReLU()])
layers = layers[:-1] # remove last ReLU
model = nn.Sequential(*layers)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
criterion = nn.BCEWithLogitsLoss()
for _ in range(n_epochs):
model.train()
for Xb, yb in train_loader:
optimizer.zero_grad()
criterion(model(Xb).squeeze(), yb).backward()
optimizer.step()
model.eval()
with torch.no_grad():
train_loss = criterion(model(X_train).squeeze(), y_train).item()
val_loss = criterion(model(X_val).squeeze(), y_val).item()
n_params = sum(p.numel() for p in model.parameters())
return {"n_params": n_params, "train_loss": train_loss, "val_loss": val_loss, "gap": val_loss - train_loss}
# Vary dataset size for the same architecture
print(f"{'n_samples':>10} {'n_params':>8} {'train':>8} {'val':>8} {'gap':>8} {'diagnosis':>15}")
for n in [200, 500, 1000, 5000]:
result = capacity_experiment(n_samples=n, hidden_dims=[128, 64, 32])
gap = result["gap"]
diag = "overfit" if gap > 0.05 else "well-fit"
print(f"{n:>10} {result['n_params']:>8,} {result['train_loss']:>8.4f} {result['val_loss']:>8.4f} {gap:>8.4f} {diag:>15}")Regularisation as Capacity Control
import torch
import torch.nn as nn
# Same architecture, different effective capacity through regularisation
base_arch = [64, 32]
n_features = 20
def make_model(dropout: float, weight_decay: float) -> tuple:
dims = [n_features] + base_arch + [1]
layers = []
for in_d, out_d in zip(dims[:-2], dims[1:-1]):
layers.extend([nn.Linear(in_d, out_d), nn.ReLU(), nn.Dropout(dropout)])
layers.append(nn.Linear(base_arch[-1], 1))
model = nn.Sequential(*layers)
optimizer = torch.optim.AdamW(
model.parameters(),
lr=1e-3,
weight_decay=weight_decay, # L2 regularisation
)
return model, optimizer
# Low regularisation → high effective capacity (risk of overfit)
m_low, opt_low = make_model(dropout=0.0, weight_decay=0.0)
# High regularisation → lower effective capacity (more conservative)
m_high, opt_high = make_model(dropout=0.5, weight_decay=1e-3)
print("Low regularisation: large effective capacity")
print("High regularisation: smaller effective capacity (more conservative)")Scaling Rules
When to add capacity:
1. Train loss > val loss gap is small, but both losses are high → underfit
→ Add more layers or wider layers
2. Model performs well on easy examples but struggles with hard ones
→ Increase capacity
3. Dataset is large (>100K samples) and complex
→ Larger models consistently improve with more data
When to reduce capacity / add regularisation:
1. Train loss much lower than val loss → overfit
→ Increase Dropout, add weight decay, or reduce architecture size
2. Dataset is small (<10K samples)
→ Prefer smaller models or pre-trained models with few trainable parameters
3. Features have known structure that a simpler model respects
→ Use the right inductive bias (e.g., linear model for linear relationships)Interview Answer
"Network capacity refers to the complexity of functions a network can represent, primarily controlled by parameter count (width × depth). Diagnosing capacity: if train loss is high → underfitting (increase capacity); if train loss is much lower than val loss → overfitting (reduce capacity or add regularisation). Regularisation reduces effective capacity without changing architecture: Dropout randomly disables neurons, weight decay penalises large weights (L2), and data augmentation effectively increases dataset size. The key rule: capacity should be matched to dataset size — a 50M-parameter model on 1,000 samples will memorise; a 10K-parameter model on 1M samples will underfit. In clinical settings with limited labelled data, start conservative (small architecture, strong regularisation) and verify on holdout data before adding capacity. Fine-tuning a pre-trained model is often better than training large capacity from scratch."