CNN Filters and Pooling
How convolutional filters detect features, stride and padding, max vs average pooling, and the feature map hierarchy from edges to objects.
The Convolution Filter
A filter (kernel) is a small weight matrix that slides over the input.
At each position, it computes the dot product with the overlapping patch.
3Γ3 filter applied to a 5Γ5 input:
Input: Filter (edge detector): Output at top-left position:
1 2 3 4 5 -1 0 1 (1Γ-1 + 2Γ0 + 3Γ1 +
3 4 5 6 7 -1 0 1 3Γ-1 + 4Γ0 + 5Γ1 +
5 6 7 8 9 -1 0 1 5Γ-1 + 6Γ0 + 7Γ1) = 6
7 8 9 1 2
9 1 2 3 4
The filter "detects" vertical edges (left-dark, right-bright patterns).
High positive response = pattern matches at this location.Stride and Padding
Stride: how many pixels the filter moves each step
stride=1 (default): overlap heavily, output size β input size
stride=2: skip every other position, halves output size
stride used for downsampling (alternative to pooling)
Padding: add zeros around the input border
padding=0 (valid): output shrinks with each conv layer
padding=1 (same): output = input size (for 3Γ3 filter with stride=1)
Output size formula:
H_out = floor((H_in + 2Γpadding - kernel_size) / stride) + 1
Examples:
H_in=224, kernel=3, padding=1, stride=1: floor((224+2-3)/1)+1 = 224 (same)
H_in=224, kernel=3, padding=0, stride=1: floor((224-3)/1)+1 = 222 (shrinks)
H_in=224, kernel=3, padding=1, stride=2: floor((224+2-3)/2)+1 = 112 (halved)PyTorch Conv2d Parameters
import torch
import torch.nn as nn
# nn.Conv2d(in_channels, out_channels, kernel_size, stride=1, padding=0)
conv = nn.Conv2d(
in_channels=3, # RGB input
out_channels=64, # 64 different filters β 64 feature maps
kernel_size=3, # 3Γ3 filter
stride=1, # move 1 pixel at a time
padding=1, # pad to maintain spatial size
bias=True, # one bias per output channel
)
# Parameter count
n_weights = 64 Γ 3 Γ 3 Γ 3 # out_channels Γ in_channels Γ kernel_h Γ kernel_w
n_biases = 64
total = n_weights + n_biases
print(f"Conv parameters: {total:,}") # 1,792 + 64 = 1,792
# Forward pass shape
x = torch.randn(8, 3, 224, 224) # batch=8, channels=3, height=224, width=224
y = conv(x)
print(f"Output shape: {y.shape}") # (8, 64, 224, 224)
# With stride=2 (downsampling)
conv_stride = nn.Conv2d(3, 64, kernel_size=3, stride=2, padding=1)
y2 = conv_stride(x)
print(f"Stride=2 output: {y2.shape}") # (8, 64, 112, 112)Pooling
Pooling reduces spatial dimensions while preserving the most important features.
MaxPool2d (most common):
Takes the maximum value in each window
Keeps the strongest feature activation
Gives spatial invariance β the feature just needs to appear somewhere in the window
AveragePool2d:
Takes the average value in each window
Smoother spatial representation
Used for: global average pooling before classification head
GlobalAveragePool (AdaptiveAvgPool):
Pools the ENTIRE feature map to a single value per channel
Eliminates fixed-size constraint β works on any input size
Standard in modern architectures (ResNet, EfficientNet)# MaxPool2d: 2Γ2 window, stride 2 (default stride = kernel_size)
maxpool = nn.MaxPool2d(kernel_size=2, stride=2)
x = torch.randn(8, 64, 56, 56)
y = maxpool(x)
print(f"After MaxPool: {y.shape}") # (8, 64, 28, 28)
# GlobalAveragePool: each 28Γ28 map β single value
gap = nn.AdaptiveAvgPool2d(output_size=1)
y2 = gap(y)
print(f"After GAP: {y2.shape}") # (8, 64, 1, 1)
# Flatten for classifier
flat = y2.flatten(start_dim=1)
print(f"After flatten: {flat.shape}") # (8, 64)
# The combination: GAP β Flatten β Linear
# is the standard CNN classification head in modern architectures
head = nn.Sequential(
nn.AdaptiveAvgPool2d(1),
nn.Flatten(),
nn.Linear(64, 10), # 10 classes
)Feature Hierarchy
What filters learn at each depth (from visualisation research):
Layer 1 (shallow):
Edges (horizontal, vertical, diagonal)
Colour gradients
Simple textures
Layer 2:
Corners, T-junctions
Simple textures (stripes, grids)
Curves
Layer 3:
Complex textures
Object parts (eye shapes, wheel shapes, leaf shapes)
Orientation-specific patterns
Layer 4+:
Object prototypes
Semantic groupings
Increasingly abstract "concepts"
For medical imaging:
Early layers: pixel-level gradients
Mid layers: anatomical boundaries, tissue textures
Late layers: pathological patterns (lesion shapes, abnormal densities)Receptive Field
Each feature map cell "sees" a limited region of the original input.
This is the receptive field.
3Γ3 conv, stride 1:
Layer 1: 3Γ3 receptive field
Layer 2 (another 3Γ3): 5Γ5 receptive field
Layer 3 (another 3Γ3): 7Γ7 receptive field
Two 3Γ3 convolutions have the same receptive field as one 5Γ5 convolution
but fewer parameters (2 Γ 3Γ3 = 18 vs 5Γ5 = 25) and more non-linearity.
This is why modern CNNs use stacks of 3Γ3 filters rather than large filters.Interview Answer
"CNN filters are small weight matrices (typically 3Γ3) that slide over input feature maps computing dot products β each filter detects a specific spatial pattern wherever it appears. Stride controls the step size (stride=2 halves the spatial dimensions); padding maintains the spatial size (padding=1 for 3Γ3 filters). MaxPooling takes the maximum in each window, providing spatial invariance and downsampling. Global Average Pooling converts each feature map to a single value, giving the classification head a fixed-size input regardless of image dimensions. Deeper layers have larger effective receptive fields β early layers detect edges, deeper layers detect object parts and semantic concepts. Modern CNNs stack 3Γ3 filters because two 3Γ3s have the same receptive field as one 5Γ5 with fewer parameters and more non-linearity."
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.