Learnixo
Back to blog
AI Systemsintermediate

ResNet and VGG

The VGG design philosophy, why ResNet's skip connections solved the degradation problem, and how these architectures shaped modern deep learning.

Asma Hafeez KhanMay 22, 20266 min read
Deep LearningResNetVGGSkip ConnectionsArchitectureInterview
Share:𝕏

VGG: Depth Through Simplicity

VGG (Simonyan & Zisserman, 2014):
  Key insight: use ONLY 3×3 convolutions, stacked deeply.
  
  Why 3×3 only?
    Two 3×3 convs have the same receptive field as one 5×5 conv
    but fewer parameters (2×9 = 18 vs 25) and more non-linearity.
    Three 3×3 convs ≈ one 7×7 conv.
  
  Architecture:
    VGG16: 13 conv layers + 3 FC layers = 138M parameters
    VGG19: 16 conv layers + 3 FC layers = 143M parameters
    
  Problems with VGG:
    1. 138M parameters — very slow to train and large to deploy
    2. FC layers are massive (4096 × 4096 × 3 ≈ 120M params)
    3. No skip connections → cannot go much deeper without degradation
    4. Accuracy degrades with more layers beyond ~19

The Degradation Problem

Adding more layers to a deep network should not hurt training accuracy.
At worst, extra layers could learn identity mappings.

Empirically: networks with 56 layers had HIGHER training error than 20-layer networks.
This is not overfitting — training error was worse. This is degradation.

Cause: In a very deep network without residual connections, gradient must flow
through every layer. The optimiser struggles to learn identity mappings
(learning F(x) = 0 so output = x is surprisingly hard for networks).

ResNet's insight: residual learning.
  Instead of learning H(x) directly, learn F(x) = H(x) - x
  Then output = F(x) + x
  Learning F(x) = 0 (identity) is easy — just push all weights to zero.

ResNet: Residual Blocks

Python
import torch
import torch.nn as nn

class BasicBlock(nn.Module):
    """ResNet basic block for ResNet-18/34."""
    expansion = 1
    
    def __init__(
        self,
        in_channels: int,
        out_channels: int,
        stride: int = 1,
    ):
        super().__init__()
        self.conv1 = nn.Conv2d(in_channels, out_channels, 3, stride=stride, padding=1, bias=False)
        self.bn1   = nn.BatchNorm2d(out_channels)
        self.conv2 = nn.Conv2d(out_channels, out_channels, 3, padding=1, bias=False)
        self.bn2   = nn.BatchNorm2d(out_channels)
        self.relu  = nn.ReLU(inplace=True)
        
        # Shortcut connection: adjust dimensions if stride or channels change
        self.shortcut = nn.Identity()
        if stride != 1 or in_channels != out_channels:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_channels, out_channels, 1, stride=stride, bias=False),
                nn.BatchNorm2d(out_channels),
            )
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        residual = self.shortcut(x)   # shortcut path
        
        out = self.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
        
        out = self.relu(out + residual)   # skip connection: F(x) + x
        return out

class BottleneckBlock(nn.Module):
    """ResNet bottleneck block for ResNet-50/101/152."""
    expansion = 4
    
    def __init__(self, in_channels: int, out_channels: int, stride: int = 1):
        super().__init__()
        bottleneck = out_channels // 4
        
        self.conv1 = nn.Conv2d(in_channels, bottleneck, 1, bias=False)   # 1×1 compress
        self.bn1   = nn.BatchNorm2d(bottleneck)
        self.conv2 = nn.Conv2d(bottleneck, bottleneck, 3, stride=stride, padding=1, bias=False)  # 3×3
        self.bn2   = nn.BatchNorm2d(bottleneck)
        self.conv3 = nn.Conv2d(bottleneck, out_channels, 1, bias=False)  # 1×1 expand
        self.bn3   = nn.BatchNorm2d(out_channels)
        self.relu  = nn.ReLU(inplace=True)
        
        self.shortcut = nn.Sequential(
            nn.Conv2d(in_channels, out_channels, 1, stride=stride, bias=False),
            nn.BatchNorm2d(out_channels),
        ) if stride != 1 or in_channels != out_channels else nn.Identity()
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        residual = self.shortcut(x)
        
        out = self.relu(self.bn1(self.conv1(x)))
        out = self.relu(self.bn2(self.conv2(out)))
        out = self.bn3(self.conv3(out))
        
        return self.relu(out + residual)

# Test
x = torch.randn(4, 64, 56, 56)
basic = BasicBlock(64, 64)
bottleneck = BottleneckBlock(64, 256, stride=1)

print(f"BasicBlock: {x.shape} → {basic(x).shape}")
print(f"BottleneckBlock: {x.shape} → {bottleneck(x).shape}")

Why Skip Connections Work

Python
import torch
import torch.nn as nn

# Gradient analysis: in a residual network, dL/dx at any layer
# always includes the identity term from the skip connection:
#
# x_{l+1} = x_l + F(x_l, W_l)
# dL/dx_l = dL/dx_{l+1} × (1 + dF/dx_l)
#
# The "+1" ensures gradient has a direct path regardless of dF/dx_l.
# In contrast, without skip: dL/dx_l = dL/dx_{l+1} × dF/dx_l
# If dF/dx_l is small  vanishing gradient.

def check_residual_gradient(n_blocks: int) -> None:
    """Compare gradient norm at first layer: residual vs plain network."""
    
    class PlainBlock(nn.Module):
        def __init__(self, dim: int):
            super().__init__()
            self.net = nn.Sequential(nn.Linear(dim, dim), nn.ReLU(), nn.Linear(dim, dim))
        
        def forward(self, x):
            return self.net(x)
    
    class ResBlock(nn.Module):
        def __init__(self, dim: int):
            super().__init__()
            self.net = nn.Sequential(nn.Linear(dim, dim), nn.ReLU(), nn.Linear(dim, dim))
        
        def forward(self, x):
            return x + self.net(x)   # residual
    
    for name, BlockClass in [("Plain", PlainBlock), ("Residual", ResBlock)]:
        blocks = nn.Sequential(*[BlockClass(64) for _ in range(n_blocks)], nn.Linear(64, 1))
        X = torch.randn(32, 64)
        y = torch.randint(0, 2, (32,)).float()
        nn.BCEWithLogitsLoss()(blocks(X).squeeze(), y).backward()
        first_grad = list(blocks.parameters())[0].grad.norm().item()
        print(f"{name} ({n_blocks} blocks): first-layer grad norm = {first_grad:.4f}")

check_residual_gradient(n_blocks=10)
check_residual_gradient(n_blocks=20)

ResNet vs VGG Comparison

Architecture  | Params  | Top-1 (ImageNet) | Depth | Skip connections
--------------|---------|------------------|-------|------------------
VGG16         | 138M    | 71.5%            | 16    | No
VGG19         | 143M    | 72.4%            | 19    | No
ResNet-18     | 11M     | 69.8%            | 18    | Yes
ResNet-34     | 21M     | 73.3%            | 34    | Yes
ResNet-50     | 25M     | 76.0%            | 50    | Yes (bottleneck)
ResNet-101    | 44M     | 77.4%            | 101   | Yes (bottleneck)
ResNet-152    | 60M     | 78.3%            | 152   | Yes (bottleneck)

Key: ResNet-50 achieves better accuracy than VGG19 with 5.7× fewer parameters.

Using Pretrained ResNets

Python
import torchvision.models as models
import torch.nn as nn

# ResNet-50 for 2-class medical classification
resnet50 = models.resnet50(pretrained=True)

# Replace the head
n_features = resnet50.fc.in_features   # 2048
resnet50.fc = nn.Sequential(
    nn.Dropout(0.5),
    nn.Linear(n_features, 2),
)

# Parameter count
total    = sum(p.numel() for p in resnet50.parameters())
trainable = sum(p.numel() for p in resnet50.parameters() if p.requires_grad)
print(f"Total: {total:,}, Trainable: {trainable:,}")

# For fine-tuning: freeze early layers
for name, param in resnet50.named_parameters():
    if "layer1" in name or "layer2" in name:
        param.requires_grad = False

trainable = sum(p.numel() for p in resnet50.parameters() if p.requires_grad)
print(f"Trainable after freezing early layers: {trainable:,}")

Interview Answer

"VGG popularised using only 3×3 convolutions stacked deeply — two 3×3 convs have the same receptive field as one 5×5 with fewer parameters and more non-linearity. However, VGG suffers from degradation: performance stops improving (or gets worse) beyond ~19 layers, because optimising very deep networks to learn identity mappings is hard. ResNet (He et al., 2015) solved this with residual connections: output = F(x) + x. The shortcut allows gradients to flow directly backward through addition: dL/dx_l = dL/dx_ × (1 + dF/dx_l). The '+1' provides a gradient highway regardless of what F learns. Learning F(x) = 0 (identity) is trivial — just zero all weights. This allowed training of 152-layer networks and beyond. ResNet-50 achieves better ImageNet accuracy than VGG-19 with 5.7× fewer parameters. ResNet's residual block design became the template for modern architectures: DenseNet, EfficientNet, and Transformers all build on this principle."

Enjoyed this article?

Explore the AI Systems learning path for more.

Found this helpful?

Share:𝕏

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.