ResNet and VGG
The VGG design philosophy, why ResNet's skip connections solved the degradation problem, and how these architectures shaped modern deep learning.
VGG: Depth Through Simplicity
VGG (Simonyan & Zisserman, 2014):
Key insight: use ONLY 3×3 convolutions, stacked deeply.
Why 3×3 only?
Two 3×3 convs have the same receptive field as one 5×5 conv
but fewer parameters (2×9 = 18 vs 25) and more non-linearity.
Three 3×3 convs ≈ one 7×7 conv.
Architecture:
VGG16: 13 conv layers + 3 FC layers = 138M parameters
VGG19: 16 conv layers + 3 FC layers = 143M parameters
Problems with VGG:
1. 138M parameters — very slow to train and large to deploy
2. FC layers are massive (4096 × 4096 × 3 ≈ 120M params)
3. No skip connections → cannot go much deeper without degradation
4. Accuracy degrades with more layers beyond ~19The Degradation Problem
Adding more layers to a deep network should not hurt training accuracy.
At worst, extra layers could learn identity mappings.
Empirically: networks with 56 layers had HIGHER training error than 20-layer networks.
This is not overfitting — training error was worse. This is degradation.
Cause: In a very deep network without residual connections, gradient must flow
through every layer. The optimiser struggles to learn identity mappings
(learning F(x) = 0 so output = x is surprisingly hard for networks).
ResNet's insight: residual learning.
Instead of learning H(x) directly, learn F(x) = H(x) - x
Then output = F(x) + x
Learning F(x) = 0 (identity) is easy — just push all weights to zero.ResNet: Residual Blocks
import torch
import torch.nn as nn
class BasicBlock(nn.Module):
"""ResNet basic block for ResNet-18/34."""
expansion = 1
def __init__(
self,
in_channels: int,
out_channels: int,
stride: int = 1,
):
super().__init__()
self.conv1 = nn.Conv2d(in_channels, out_channels, 3, stride=stride, padding=1, bias=False)
self.bn1 = nn.BatchNorm2d(out_channels)
self.conv2 = nn.Conv2d(out_channels, out_channels, 3, padding=1, bias=False)
self.bn2 = nn.BatchNorm2d(out_channels)
self.relu = nn.ReLU(inplace=True)
# Shortcut connection: adjust dimensions if stride or channels change
self.shortcut = nn.Identity()
if stride != 1 or in_channels != out_channels:
self.shortcut = nn.Sequential(
nn.Conv2d(in_channels, out_channels, 1, stride=stride, bias=False),
nn.BatchNorm2d(out_channels),
)
def forward(self, x: torch.Tensor) -> torch.Tensor:
residual = self.shortcut(x) # shortcut path
out = self.relu(self.bn1(self.conv1(x)))
out = self.bn2(self.conv2(out))
out = self.relu(out + residual) # skip connection: F(x) + x
return out
class BottleneckBlock(nn.Module):
"""ResNet bottleneck block for ResNet-50/101/152."""
expansion = 4
def __init__(self, in_channels: int, out_channels: int, stride: int = 1):
super().__init__()
bottleneck = out_channels // 4
self.conv1 = nn.Conv2d(in_channels, bottleneck, 1, bias=False) # 1×1 compress
self.bn1 = nn.BatchNorm2d(bottleneck)
self.conv2 = nn.Conv2d(bottleneck, bottleneck, 3, stride=stride, padding=1, bias=False) # 3×3
self.bn2 = nn.BatchNorm2d(bottleneck)
self.conv3 = nn.Conv2d(bottleneck, out_channels, 1, bias=False) # 1×1 expand
self.bn3 = nn.BatchNorm2d(out_channels)
self.relu = nn.ReLU(inplace=True)
self.shortcut = nn.Sequential(
nn.Conv2d(in_channels, out_channels, 1, stride=stride, bias=False),
nn.BatchNorm2d(out_channels),
) if stride != 1 or in_channels != out_channels else nn.Identity()
def forward(self, x: torch.Tensor) -> torch.Tensor:
residual = self.shortcut(x)
out = self.relu(self.bn1(self.conv1(x)))
out = self.relu(self.bn2(self.conv2(out)))
out = self.bn3(self.conv3(out))
return self.relu(out + residual)
# Test
x = torch.randn(4, 64, 56, 56)
basic = BasicBlock(64, 64)
bottleneck = BottleneckBlock(64, 256, stride=1)
print(f"BasicBlock: {x.shape} → {basic(x).shape}")
print(f"BottleneckBlock: {x.shape} → {bottleneck(x).shape}")Why Skip Connections Work
import torch
import torch.nn as nn
# Gradient analysis: in a residual network, dL/dx at any layer
# always includes the identity term from the skip connection:
#
# x_{l+1} = x_l + F(x_l, W_l)
# dL/dx_l = dL/dx_{l+1} × (1 + dF/dx_l)
#
# The "+1" ensures gradient has a direct path regardless of dF/dx_l.
# In contrast, without skip: dL/dx_l = dL/dx_{l+1} × dF/dx_l
# If dF/dx_l is small → vanishing gradient.
def check_residual_gradient(n_blocks: int) -> None:
"""Compare gradient norm at first layer: residual vs plain network."""
class PlainBlock(nn.Module):
def __init__(self, dim: int):
super().__init__()
self.net = nn.Sequential(nn.Linear(dim, dim), nn.ReLU(), nn.Linear(dim, dim))
def forward(self, x):
return self.net(x)
class ResBlock(nn.Module):
def __init__(self, dim: int):
super().__init__()
self.net = nn.Sequential(nn.Linear(dim, dim), nn.ReLU(), nn.Linear(dim, dim))
def forward(self, x):
return x + self.net(x) # residual
for name, BlockClass in [("Plain", PlainBlock), ("Residual", ResBlock)]:
blocks = nn.Sequential(*[BlockClass(64) for _ in range(n_blocks)], nn.Linear(64, 1))
X = torch.randn(32, 64)
y = torch.randint(0, 2, (32,)).float()
nn.BCEWithLogitsLoss()(blocks(X).squeeze(), y).backward()
first_grad = list(blocks.parameters())[0].grad.norm().item()
print(f"{name} ({n_blocks} blocks): first-layer grad norm = {first_grad:.4f}")
check_residual_gradient(n_blocks=10)
check_residual_gradient(n_blocks=20)ResNet vs VGG Comparison
Architecture | Params | Top-1 (ImageNet) | Depth | Skip connections
--------------|---------|------------------|-------|------------------
VGG16 | 138M | 71.5% | 16 | No
VGG19 | 143M | 72.4% | 19 | No
ResNet-18 | 11M | 69.8% | 18 | Yes
ResNet-34 | 21M | 73.3% | 34 | Yes
ResNet-50 | 25M | 76.0% | 50 | Yes (bottleneck)
ResNet-101 | 44M | 77.4% | 101 | Yes (bottleneck)
ResNet-152 | 60M | 78.3% | 152 | Yes (bottleneck)
Key: ResNet-50 achieves better accuracy than VGG19 with 5.7× fewer parameters.Using Pretrained ResNets
import torchvision.models as models
import torch.nn as nn
# ResNet-50 for 2-class medical classification
resnet50 = models.resnet50(pretrained=True)
# Replace the head
n_features = resnet50.fc.in_features # 2048
resnet50.fc = nn.Sequential(
nn.Dropout(0.5),
nn.Linear(n_features, 2),
)
# Parameter count
total = sum(p.numel() for p in resnet50.parameters())
trainable = sum(p.numel() for p in resnet50.parameters() if p.requires_grad)
print(f"Total: {total:,}, Trainable: {trainable:,}")
# For fine-tuning: freeze early layers
for name, param in resnet50.named_parameters():
if "layer1" in name or "layer2" in name:
param.requires_grad = False
trainable = sum(p.numel() for p in resnet50.parameters() if p.requires_grad)
print(f"Trainable after freezing early layers: {trainable:,}")Interview Answer
"VGG popularised using only 3×3 convolutions stacked deeply — two 3×3 convs have the same receptive field as one 5×5 with fewer parameters and more non-linearity. However, VGG suffers from degradation: performance stops improving (or gets worse) beyond ~19 layers, because optimising very deep networks to learn identity mappings is hard. ResNet (He et al., 2015) solved this with residual connections: output = F(x) + x. The shortcut allows gradients to flow directly backward through addition: dL/dx_l = dL/dx_ × (1 + dF/dx_l). The '+1' provides a gradient highway regardless of what F learns. Learning F(x) = 0 (identity) is trivial — just zero all weights. This allowed training of 152-layer networks and beyond. ResNet-50 achieves better ImageNet accuracy than VGG-19 with 5.7× fewer parameters. ResNet's residual block design became the template for modern architectures: DenseNet, EfficientNet, and Transformers all build on this principle."
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.