Deep Learning for AI Interviews · Lesson 14 of 56
Filters, Pooling, and Receptive Fields
The Convolution Filter
A filter (kernel) is a small weight matrix that slides over the input.
At each position, it computes the dot product with the overlapping patch.
3×3 filter applied to a 5×5 input:
Input: Filter (edge detector): Output at top-left position:
1 2 3 4 5 -1 0 1 (1×-1 + 2×0 + 3×1 +
3 4 5 6 7 -1 0 1 3×-1 + 4×0 + 5×1 +
5 6 7 8 9 -1 0 1 5×-1 + 6×0 + 7×1) = 6
7 8 9 1 2
9 1 2 3 4
The filter "detects" vertical edges (left-dark, right-bright patterns).
High positive response = pattern matches at this location.Stride and Padding
Stride: how many pixels the filter moves each step
stride=1 (default): overlap heavily, output size ≈ input size
stride=2: skip every other position, halves output size
stride used for downsampling (alternative to pooling)
Padding: add zeros around the input border
padding=0 (valid): output shrinks with each conv layer
padding=1 (same): output = input size (for 3×3 filter with stride=1)
Output size formula:
H_out = floor((H_in + 2×padding - kernel_size) / stride) + 1
Examples:
H_in=224, kernel=3, padding=1, stride=1: floor((224+2-3)/1)+1 = 224 (same)
H_in=224, kernel=3, padding=0, stride=1: floor((224-3)/1)+1 = 222 (shrinks)
H_in=224, kernel=3, padding=1, stride=2: floor((224+2-3)/2)+1 = 112 (halved)PyTorch Conv2d Parameters
import torch
import torch.nn as nn
# nn.Conv2d(in_channels, out_channels, kernel_size, stride=1, padding=0)
conv = nn.Conv2d(
in_channels=3, # RGB input
out_channels=64, # 64 different filters → 64 feature maps
kernel_size=3, # 3×3 filter
stride=1, # move 1 pixel at a time
padding=1, # pad to maintain spatial size
bias=True, # one bias per output channel
)
# Parameter count
n_weights = 64 × 3 × 3 × 3 # out_channels × in_channels × kernel_h × kernel_w
n_biases = 64
total = n_weights + n_biases
print(f"Conv parameters: {total:,}") # 1,792 + 64 = 1,792
# Forward pass shape
x = torch.randn(8, 3, 224, 224) # batch=8, channels=3, height=224, width=224
y = conv(x)
print(f"Output shape: {y.shape}") # (8, 64, 224, 224)
# With stride=2 (downsampling)
conv_stride = nn.Conv2d(3, 64, kernel_size=3, stride=2, padding=1)
y2 = conv_stride(x)
print(f"Stride=2 output: {y2.shape}") # (8, 64, 112, 112)Pooling
Pooling reduces spatial dimensions while preserving the most important features.
MaxPool2d (most common):
Takes the maximum value in each window
Keeps the strongest feature activation
Gives spatial invariance — the feature just needs to appear somewhere in the window
AveragePool2d:
Takes the average value in each window
Smoother spatial representation
Used for: global average pooling before classification head
GlobalAveragePool (AdaptiveAvgPool):
Pools the ENTIRE feature map to a single value per channel
Eliminates fixed-size constraint — works on any input size
Standard in modern architectures (ResNet, EfficientNet)# MaxPool2d: 2×2 window, stride 2 (default stride = kernel_size)
maxpool = nn.MaxPool2d(kernel_size=2, stride=2)
x = torch.randn(8, 64, 56, 56)
y = maxpool(x)
print(f"After MaxPool: {y.shape}") # (8, 64, 28, 28)
# GlobalAveragePool: each 28×28 map → single value
gap = nn.AdaptiveAvgPool2d(output_size=1)
y2 = gap(y)
print(f"After GAP: {y2.shape}") # (8, 64, 1, 1)
# Flatten for classifier
flat = y2.flatten(start_dim=1)
print(f"After flatten: {flat.shape}") # (8, 64)
# The combination: GAP → Flatten → Linear
# is the standard CNN classification head in modern architectures
head = nn.Sequential(
nn.AdaptiveAvgPool2d(1),
nn.Flatten(),
nn.Linear(64, 10), # 10 classes
)Feature Hierarchy
What filters learn at each depth (from visualisation research):
Layer 1 (shallow):
Edges (horizontal, vertical, diagonal)
Colour gradients
Simple textures
Layer 2:
Corners, T-junctions
Simple textures (stripes, grids)
Curves
Layer 3:
Complex textures
Object parts (eye shapes, wheel shapes, leaf shapes)
Orientation-specific patterns
Layer 4+:
Object prototypes
Semantic groupings
Increasingly abstract "concepts"
For medical imaging:
Early layers: pixel-level gradients
Mid layers: anatomical boundaries, tissue textures
Late layers: pathological patterns (lesion shapes, abnormal densities)Receptive Field
Each feature map cell "sees" a limited region of the original input.
This is the receptive field.
3×3 conv, stride 1:
Layer 1: 3×3 receptive field
Layer 2 (another 3×3): 5×5 receptive field
Layer 3 (another 3×3): 7×7 receptive field
Two 3×3 convolutions have the same receptive field as one 5×5 convolution
but fewer parameters (2 × 3×3 = 18 vs 5×5 = 25) and more non-linearity.
This is why modern CNNs use stacks of 3×3 filters rather than large filters.Interview Answer
"CNN filters are small weight matrices (typically 3×3) that slide over input feature maps computing dot products — each filter detects a specific spatial pattern wherever it appears. Stride controls the step size (stride=2 halves the spatial dimensions); padding maintains the spatial size (padding=1 for 3×3 filters). MaxPooling takes the maximum in each window, providing spatial invariance and downsampling. Global Average Pooling converts each feature map to a single value, giving the classification head a fixed-size input regardless of image dimensions. Deeper layers have larger effective receptive fields — early layers detect edges, deeper layers detect object parts and semantic concepts. Modern CNNs stack 3×3 filters because two 3×3s have the same receptive field as one 5×5 with fewer parameters and more non-linearity."