Learnixo

Deep Learning for AI Interviews · Lesson 45 of 56

ResNet, VGG, and Skip Connections

VGG: Depth Through Simplicity

VGG (Simonyan & Zisserman, 2014):
  Key insight: use ONLY 3×3 convolutions, stacked deeply.
  
  Why 3×3 only?
    Two 3×3 convs have the same receptive field as one 5×5 conv
    but fewer parameters (2×9 = 18 vs 25) and more non-linearity.
    Three 3×3 convs ≈ one 7×7 conv.
  
  Architecture:
    VGG16: 13 conv layers + 3 FC layers = 138M parameters
    VGG19: 16 conv layers + 3 FC layers = 143M parameters
    
  Problems with VGG:
    1. 138M parameters — very slow to train and large to deploy
    2. FC layers are massive (4096 × 4096 × 3 ≈ 120M params)
    3. No skip connections → cannot go much deeper without degradation
    4. Accuracy degrades with more layers beyond ~19

The Degradation Problem

Adding more layers to a deep network should not hurt training accuracy.
At worst, extra layers could learn identity mappings.

Empirically: networks with 56 layers had HIGHER training error than 20-layer networks.
This is not overfitting — training error was worse. This is degradation.

Cause: In a very deep network without residual connections, gradient must flow
through every layer. The optimiser struggles to learn identity mappings
(learning F(x) = 0 so output = x is surprisingly hard for networks).

ResNet's insight: residual learning.
  Instead of learning H(x) directly, learn F(x) = H(x) - x
  Then output = F(x) + x
  Learning F(x) = 0 (identity) is easy — just push all weights to zero.

ResNet: Residual Blocks

Python
import torch
import torch.nn as nn

class BasicBlock(nn.Module):
    """ResNet basic block for ResNet-18/34."""
    expansion = 1
    
    def __init__(
        self,
        in_channels: int,
        out_channels: int,
        stride: int = 1,
    ):
        super().__init__()
        self.conv1 = nn.Conv2d(in_channels, out_channels, 3, stride=stride, padding=1, bias=False)
        self.bn1   = nn.BatchNorm2d(out_channels)
        self.conv2 = nn.Conv2d(out_channels, out_channels, 3, padding=1, bias=False)
        self.bn2   = nn.BatchNorm2d(out_channels)
        self.relu  = nn.ReLU(inplace=True)
        
        # Shortcut connection: adjust dimensions if stride or channels change
        self.shortcut = nn.Identity()
        if stride != 1 or in_channels != out_channels:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_channels, out_channels, 1, stride=stride, bias=False),
                nn.BatchNorm2d(out_channels),
            )
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        residual = self.shortcut(x)   # shortcut path
        
        out = self.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
        
        out = self.relu(out + residual)   # skip connection: F(x) + x
        return out

class BottleneckBlock(nn.Module):
    """ResNet bottleneck block for ResNet-50/101/152."""
    expansion = 4
    
    def __init__(self, in_channels: int, out_channels: int, stride: int = 1):
        super().__init__()
        bottleneck = out_channels // 4
        
        self.conv1 = nn.Conv2d(in_channels, bottleneck, 1, bias=False)   # 1×1 compress
        self.bn1   = nn.BatchNorm2d(bottleneck)
        self.conv2 = nn.Conv2d(bottleneck, bottleneck, 3, stride=stride, padding=1, bias=False)  # 3×3
        self.bn2   = nn.BatchNorm2d(bottleneck)
        self.conv3 = nn.Conv2d(bottleneck, out_channels, 1, bias=False)  # 1×1 expand
        self.bn3   = nn.BatchNorm2d(out_channels)
        self.relu  = nn.ReLU(inplace=True)
        
        self.shortcut = nn.Sequential(
            nn.Conv2d(in_channels, out_channels, 1, stride=stride, bias=False),
            nn.BatchNorm2d(out_channels),
        ) if stride != 1 or in_channels != out_channels else nn.Identity()
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        residual = self.shortcut(x)
        
        out = self.relu(self.bn1(self.conv1(x)))
        out = self.relu(self.bn2(self.conv2(out)))
        out = self.bn3(self.conv3(out))
        
        return self.relu(out + residual)

# Test
x = torch.randn(4, 64, 56, 56)
basic = BasicBlock(64, 64)
bottleneck = BottleneckBlock(64, 256, stride=1)

print(f"BasicBlock: {x.shape} → {basic(x).shape}")
print(f"BottleneckBlock: {x.shape} → {bottleneck(x).shape}")

Why Skip Connections Work

Python
import torch
import torch.nn as nn

# Gradient analysis: in a residual network, dL/dx at any layer
# always includes the identity term from the skip connection:
#
# x_{l+1} = x_l + F(x_l, W_l)
# dL/dx_l = dL/dx_{l+1} × (1 + dF/dx_l)
#
# The "+1" ensures gradient has a direct path regardless of dF/dx_l.
# In contrast, without skip: dL/dx_l = dL/dx_{l+1} × dF/dx_l
# If dF/dx_l is small  vanishing gradient.

def check_residual_gradient(n_blocks: int) -> None:
    """Compare gradient norm at first layer: residual vs plain network."""
    
    class PlainBlock(nn.Module):
        def __init__(self, dim: int):
            super().__init__()
            self.net = nn.Sequential(nn.Linear(dim, dim), nn.ReLU(), nn.Linear(dim, dim))
        
        def forward(self, x):
            return self.net(x)
    
    class ResBlock(nn.Module):
        def __init__(self, dim: int):
            super().__init__()
            self.net = nn.Sequential(nn.Linear(dim, dim), nn.ReLU(), nn.Linear(dim, dim))
        
        def forward(self, x):
            return x + self.net(x)   # residual
    
    for name, BlockClass in [("Plain", PlainBlock), ("Residual", ResBlock)]:
        blocks = nn.Sequential(*[BlockClass(64) for _ in range(n_blocks)], nn.Linear(64, 1))
        X = torch.randn(32, 64)
        y = torch.randint(0, 2, (32,)).float()
        nn.BCEWithLogitsLoss()(blocks(X).squeeze(), y).backward()
        first_grad = list(blocks.parameters())[0].grad.norm().item()
        print(f"{name} ({n_blocks} blocks): first-layer grad norm = {first_grad:.4f}")

check_residual_gradient(n_blocks=10)
check_residual_gradient(n_blocks=20)

ResNet vs VGG Comparison

Architecture  | Params  | Top-1 (ImageNet) | Depth | Skip connections
--------------|---------|------------------|-------|------------------
VGG16         | 138M    | 71.5%            | 16    | No
VGG19         | 143M    | 72.4%            | 19    | No
ResNet-18     | 11M     | 69.8%            | 18    | Yes
ResNet-34     | 21M     | 73.3%            | 34    | Yes
ResNet-50     | 25M     | 76.0%            | 50    | Yes (bottleneck)
ResNet-101    | 44M     | 77.4%            | 101   | Yes (bottleneck)
ResNet-152    | 60M     | 78.3%            | 152   | Yes (bottleneck)

Key: ResNet-50 achieves better accuracy than VGG19 with 5.7× fewer parameters.

Using Pretrained ResNets

Python
import torchvision.models as models
import torch.nn as nn

# ResNet-50 for 2-class medical classification
resnet50 = models.resnet50(pretrained=True)

# Replace the head
n_features = resnet50.fc.in_features   # 2048
resnet50.fc = nn.Sequential(
    nn.Dropout(0.5),
    nn.Linear(n_features, 2),
)

# Parameter count
total    = sum(p.numel() for p in resnet50.parameters())
trainable = sum(p.numel() for p in resnet50.parameters() if p.requires_grad)
print(f"Total: {total:,}, Trainable: {trainable:,}")

# For fine-tuning: freeze early layers
for name, param in resnet50.named_parameters():
    if "layer1" in name or "layer2" in name:
        param.requires_grad = False

trainable = sum(p.numel() for p in resnet50.parameters() if p.requires_grad)
print(f"Trainable after freezing early layers: {trainable:,}")

Interview Answer

"VGG popularised using only 3×3 convolutions stacked deeply — two 3×3 convs have the same receptive field as one 5×5 with fewer parameters and more non-linearity. However, VGG suffers from degradation: performance stops improving (or gets worse) beyond ~19 layers, because optimising very deep networks to learn identity mappings is hard. ResNet (He et al., 2015) solved this with residual connections: output = F(x) + x. The shortcut allows gradients to flow directly backward through addition: dL/dx_l = dL/dx_ × (1 + dF/dx_l). The '+1' provides a gradient highway regardless of what F learns. Learning F(x) = 0 (identity) is trivial — just zero all weights. This allowed training of 152-layer networks and beyond. ResNet-50 achieves better ImageNet accuracy than VGG-19 with 5.7× fewer parameters. ResNet's residual block design became the template for modern architectures: DenseNet, EfficientNet, and Transformers all build on this principle."