Network Capacity and Expressivity
What capacity means, how to measure it, signs of too little or too much capacity, and how to tune architecture size to your dataset.
What Capacity Means
Network capacity = how complex a function the network can represent.
Higher capacity:
ā Can fit more complex patterns in the data
ā Can also memorise noise (overfit)
ā More parameters ā more training data needed
Lower capacity:
ā Resistant to memorisation (regularised by architecture)
ā May be unable to capture true patterns (underfit)
Capacity is controlled by:
- Number of parameters (width Ć depth)
- Architecture type (MLP vs CNN vs Transformer)
- Regularisation (Dropout, weight decay reduce effective capacity)
- Data augmentation (artificially increases effective dataset size)Diagnosing Capacity Problems
import torch
import torch.nn as nn
import numpy as np
def diagnose_capacity(
train_losses: list[float],
val_losses: list[float],
threshold_underfit: float = 0.1,
threshold_overfit: float = 0.05,
) -> str:
"""Diagnose underfitting or overfitting from loss curves."""
final_train = train_losses[-1]
final_val = val_losses[-1]
min_val = min(val_losses)
# High training loss ā underfitting
if final_train > threshold_underfit:
return f"UNDERFITTING: train_loss={final_train:.4f} is high. Increase capacity or train longer."
# Val loss much higher than train loss ā overfitting
gap = final_val - final_train
if gap > threshold_overfit:
return f"OVERFITTING: gap={gap:.4f} (val - train). Reduce capacity, add regularisation, or get more data."
# Val loss stopped improving ā early stopping point
if final_val > min_val * 1.1:
return f"TRAINING TOO LONG: best val_loss={min_val:.4f} was at epoch {val_losses.index(min_val)+1}."
return f"WELL-FIT: train={final_train:.4f}, val={final_val:.4f}, gap={gap:.4f}"
# Simulate underfitting scenario
np.random.seed(42)
n_epochs = 50
underfit_train = [0.8 - 0.005 * i for i in range(n_epochs)]
underfit_val = [0.82 - 0.004 * i for i in range(n_epochs)]
print(diagnose_capacity(underfit_train, underfit_val))
# Simulate overfitting scenario
overfit_train = [1.0 - 0.02 * i for i in range(n_epochs)]
overfit_val = [0.8 - 0.01 * i + 0.015 * max(0, i - 20) for i in range(n_epochs)]
print(diagnose_capacity(overfit_train, overfit_val))Measuring Effective Capacity: VC Dimension
VC dimension (Vapnik-Chervonenkis dimension):
The largest dataset size that the model can shatter (correctly classify
in all possible label configurations).
Linear classifier in d dimensions: VC dim ā d + 1
Single hidden layer with h neurons: VC dim ā O(h Ć d)
Deep network: VC dim ā O(W Ć L) where W=params, L=depth
Practical implication:
You need roughly 10Ć your VC dimension in training samples
to guarantee good generalisation (PAC learning bound).
But deep networks often generalise with far fewer samples
because:
- SGD has an implicit regularisation effect
- Architecture encodes useful inductive biases
- The actual function class is much smaller than VC bound suggestsCapacity vs Dataset Size
import torch
import torch.nn as nn
from torch.utils.data import TensorDataset, DataLoader
def capacity_experiment(
n_samples: int,
n_features: int = 20,
hidden_dims: list[int] = None,
n_epochs: int = 100,
) -> dict:
"""Train a model and return train/val metrics."""
hidden_dims = hidden_dims or [64, 32]
# Synthetic dataset with known structure
torch.manual_seed(42)
X = torch.randn(n_samples, n_features)
true_w = torch.randn(n_features)
y = (X @ true_w + 0.5 * torch.randn(n_samples)).sigmoid().round()
split = int(0.8 * n_samples)
X_train, X_val = X[:split], X[split:]
y_train, y_val = y[:split], y[split:]
train_loader = DataLoader(TensorDataset(X_train, y_train), batch_size=32, shuffle=True)
# Build model
dims = [n_features] + hidden_dims + [1]
layers = []
for in_d, out_d in zip(dims[:-1], dims[1:]):
layers.extend([nn.Linear(in_d, out_d), nn.ReLU()])
layers = layers[:-1] # remove last ReLU
model = nn.Sequential(*layers)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
criterion = nn.BCEWithLogitsLoss()
for _ in range(n_epochs):
model.train()
for Xb, yb in train_loader:
optimizer.zero_grad()
criterion(model(Xb).squeeze(), yb).backward()
optimizer.step()
model.eval()
with torch.no_grad():
train_loss = criterion(model(X_train).squeeze(), y_train).item()
val_loss = criterion(model(X_val).squeeze(), y_val).item()
n_params = sum(p.numel() for p in model.parameters())
return {"n_params": n_params, "train_loss": train_loss, "val_loss": val_loss, "gap": val_loss - train_loss}
# Vary dataset size for the same architecture
print(f"{'n_samples':>10} {'n_params':>8} {'train':>8} {'val':>8} {'gap':>8} {'diagnosis':>15}")
for n in [200, 500, 1000, 5000]:
result = capacity_experiment(n_samples=n, hidden_dims=[128, 64, 32])
gap = result["gap"]
diag = "overfit" if gap > 0.05 else "well-fit"
print(f"{n:>10} {result['n_params']:>8,} {result['train_loss']:>8.4f} {result['val_loss']:>8.4f} {gap:>8.4f} {diag:>15}")Regularisation as Capacity Control
import torch
import torch.nn as nn
# Same architecture, different effective capacity through regularisation
base_arch = [64, 32]
n_features = 20
def make_model(dropout: float, weight_decay: float) -> tuple:
dims = [n_features] + base_arch + [1]
layers = []
for in_d, out_d in zip(dims[:-2], dims[1:-1]):
layers.extend([nn.Linear(in_d, out_d), nn.ReLU(), nn.Dropout(dropout)])
layers.append(nn.Linear(base_arch[-1], 1))
model = nn.Sequential(*layers)
optimizer = torch.optim.AdamW(
model.parameters(),
lr=1e-3,
weight_decay=weight_decay, # L2 regularisation
)
return model, optimizer
# Low regularisation ā high effective capacity (risk of overfit)
m_low, opt_low = make_model(dropout=0.0, weight_decay=0.0)
# High regularisation ā lower effective capacity (more conservative)
m_high, opt_high = make_model(dropout=0.5, weight_decay=1e-3)
print("Low regularisation: large effective capacity")
print("High regularisation: smaller effective capacity (more conservative)")Scaling Rules
When to add capacity:
1. Train loss > val loss gap is small, but both losses are high ā underfit
ā Add more layers or wider layers
2. Model performs well on easy examples but struggles with hard ones
ā Increase capacity
3. Dataset is large (>100K samples) and complex
ā Larger models consistently improve with more data
When to reduce capacity / add regularisation:
1. Train loss much lower than val loss ā overfit
ā Increase Dropout, add weight decay, or reduce architecture size
2. Dataset is small (<10K samples)
ā Prefer smaller models or pre-trained models with few trainable parameters
3. Features have known structure that a simpler model respects
ā Use the right inductive bias (e.g., linear model for linear relationships)Interview Answer
"Network capacity refers to the complexity of functions a network can represent, primarily controlled by parameter count (width Ć depth). Diagnosing capacity: if train loss is high ā underfitting (increase capacity); if train loss is much lower than val loss ā overfitting (reduce capacity or add regularisation). Regularisation reduces effective capacity without changing architecture: Dropout randomly disables neurons, weight decay penalises large weights (L2), and data augmentation effectively increases dataset size. The key rule: capacity should be matched to dataset size ā a 50M-parameter model on 1,000 samples will memorise; a 10K-parameter model on 1M samples will underfit. In clinical settings with limited labelled data, start conservative (small architecture, strong regularisation) and verify on holdout data before adding capacity. Fine-tuning a pre-trained model is often better than training large capacity from scratch."
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.