Learnixo

Deep Learning for AI Interviews · Lesson 14 of 56

Filters, Pooling, and Receptive Fields

The Convolution Filter

A filter (kernel) is a small weight matrix that slides over the input.
At each position, it computes the dot product with the overlapping patch.

3×3 filter applied to a 5×5 input:

Input:          Filter (edge detector):   Output at top-left position:
1 2 3 4 5       -1  0  1                (1×-1 + 2×0 + 3×1 +
3 4 5 6 7       -1  0  1                 3×-1 + 4×0 + 5×1 +
5 6 7 8 9       -1  0  1                 5×-1 + 6×0 + 7×1) = 6
7 8 9 1 2
9 1 2 3 4

The filter "detects" vertical edges (left-dark, right-bright patterns).
High positive response = pattern matches at this location.

Stride and Padding

Stride: how many pixels the filter moves each step
  stride=1 (default): overlap heavily, output size ≈ input size
  stride=2: skip every other position, halves output size
  stride used for downsampling (alternative to pooling)

Padding: add zeros around the input border
  padding=0 (valid): output shrinks with each conv layer
  padding=1 (same): output = input size (for 3×3 filter with stride=1)

Output size formula:
  H_out = floor((H_in + 2×padding - kernel_size) / stride) + 1

Examples:
  H_in=224, kernel=3, padding=1, stride=1: floor((224+2-3)/1)+1 = 224 (same)
  H_in=224, kernel=3, padding=0, stride=1: floor((224-3)/1)+1 = 222 (shrinks)
  H_in=224, kernel=3, padding=1, stride=2: floor((224+2-3)/2)+1 = 112 (halved)

PyTorch Conv2d Parameters

Python
import torch
import torch.nn as nn

# nn.Conv2d(in_channels, out_channels, kernel_size, stride=1, padding=0)
conv = nn.Conv2d(
    in_channels=3,     # RGB input
    out_channels=64,   # 64 different filters  64 feature maps
    kernel_size=3,     # 3×3 filter
    stride=1,          # move 1 pixel at a time
    padding=1,         # pad to maintain spatial size
    bias=True,         # one bias per output channel
)

# Parameter count
n_weights = 64 × 3 × 3 × 3   # out_channels × in_channels × kernel_h × kernel_w
n_biases  = 64
total = n_weights + n_biases
print(f"Conv parameters: {total:,}")   # 1,792 + 64 = 1,792

# Forward pass shape
x = torch.randn(8, 3, 224, 224)   # batch=8, channels=3, height=224, width=224
y = conv(x)
print(f"Output shape: {y.shape}")  # (8, 64, 224, 224)

# With stride=2 (downsampling)
conv_stride = nn.Conv2d(3, 64, kernel_size=3, stride=2, padding=1)
y2 = conv_stride(x)
print(f"Stride=2 output: {y2.shape}")  # (8, 64, 112, 112)

Pooling

Pooling reduces spatial dimensions while preserving the most important features.

MaxPool2d (most common):
  Takes the maximum value in each window
  Keeps the strongest feature activation
  Gives spatial invariance — the feature just needs to appear somewhere in the window

AveragePool2d:
  Takes the average value in each window
  Smoother spatial representation
  Used for: global average pooling before classification head

GlobalAveragePool (AdaptiveAvgPool):
  Pools the ENTIRE feature map to a single value per channel
  Eliminates fixed-size constraint — works on any input size
  Standard in modern architectures (ResNet, EfficientNet)
Python
# MaxPool2d: 2×2 window, stride 2 (default stride = kernel_size)
maxpool = nn.MaxPool2d(kernel_size=2, stride=2)
x = torch.randn(8, 64, 56, 56)
y = maxpool(x)
print(f"After MaxPool: {y.shape}")   # (8, 64, 28, 28)

# GlobalAveragePool: each 28×28 map  single value
gap = nn.AdaptiveAvgPool2d(output_size=1)
y2 = gap(y)
print(f"After GAP: {y2.shape}")      # (8, 64, 1, 1)

# Flatten for classifier
flat = y2.flatten(start_dim=1)
print(f"After flatten: {flat.shape}")   # (8, 64)

# The combination: GAP  Flatten  Linear
# is the standard CNN classification head in modern architectures
head = nn.Sequential(
    nn.AdaptiveAvgPool2d(1),
    nn.Flatten(),
    nn.Linear(64, 10),  # 10 classes
)

Feature Hierarchy

What filters learn at each depth (from visualisation research):

Layer 1 (shallow):
  Edges (horizontal, vertical, diagonal)
  Colour gradients
  Simple textures

Layer 2:
  Corners, T-junctions
  Simple textures (stripes, grids)
  Curves

Layer 3:
  Complex textures
  Object parts (eye shapes, wheel shapes, leaf shapes)
  Orientation-specific patterns

Layer 4+:
  Object prototypes
  Semantic groupings
  Increasingly abstract "concepts"

For medical imaging:
  Early layers: pixel-level gradients
  Mid layers: anatomical boundaries, tissue textures
  Late layers: pathological patterns (lesion shapes, abnormal densities)

Receptive Field

Each feature map cell "sees" a limited region of the original input.
This is the receptive field.

3×3 conv, stride 1:
  Layer 1: 3×3 receptive field
  Layer 2 (another 3×3): 5×5 receptive field
  Layer 3 (another 3×3): 7×7 receptive field

Two 3×3 convolutions have the same receptive field as one 5×5 convolution
but fewer parameters (2 × 3×3 = 18 vs 5×5 = 25) and more non-linearity.

This is why modern CNNs use stacks of 3×3 filters rather than large filters.

Interview Answer

"CNN filters are small weight matrices (typically 3×3) that slide over input feature maps computing dot products — each filter detects a specific spatial pattern wherever it appears. Stride controls the step size (stride=2 halves the spatial dimensions); padding maintains the spatial size (padding=1 for 3×3 filters). MaxPooling takes the maximum in each window, providing spatial invariance and downsampling. Global Average Pooling converts each feature map to a single value, giving the classification head a fixed-size input regardless of image dimensions. Deeper layers have larger effective receptive fields — early layers detect edges, deeper layers detect object parts and semantic concepts. Modern CNNs stack 3×3 filters because two 3×3s have the same receptive field as one 5×5 with fewer parameters and more non-linearity."