Learnixo
Back to blog
AI Systemsintermediate

CNN Filters and Pooling

How convolutional filters detect features, stride and padding, max vs average pooling, and the feature map hierarchy from edges to objects.

Asma Hafeez KhanMay 21, 20265 min read
Deep LearningCNNFiltersPoolingFeature MapsInterview
Share:𝕏

The Convolution Filter

A filter (kernel) is a small weight matrix that slides over the input.
At each position, it computes the dot product with the overlapping patch.

3Γ—3 filter applied to a 5Γ—5 input:

Input:          Filter (edge detector):   Output at top-left position:
1 2 3 4 5       -1  0  1                (1Γ—-1 + 2Γ—0 + 3Γ—1 +
3 4 5 6 7       -1  0  1                 3Γ—-1 + 4Γ—0 + 5Γ—1 +
5 6 7 8 9       -1  0  1                 5Γ—-1 + 6Γ—0 + 7Γ—1) = 6
7 8 9 1 2
9 1 2 3 4

The filter "detects" vertical edges (left-dark, right-bright patterns).
High positive response = pattern matches at this location.

Stride and Padding

Stride: how many pixels the filter moves each step
  stride=1 (default): overlap heavily, output size β‰ˆ input size
  stride=2: skip every other position, halves output size
  stride used for downsampling (alternative to pooling)

Padding: add zeros around the input border
  padding=0 (valid): output shrinks with each conv layer
  padding=1 (same): output = input size (for 3Γ—3 filter with stride=1)

Output size formula:
  H_out = floor((H_in + 2Γ—padding - kernel_size) / stride) + 1

Examples:
  H_in=224, kernel=3, padding=1, stride=1: floor((224+2-3)/1)+1 = 224 (same)
  H_in=224, kernel=3, padding=0, stride=1: floor((224-3)/1)+1 = 222 (shrinks)
  H_in=224, kernel=3, padding=1, stride=2: floor((224+2-3)/2)+1 = 112 (halved)

PyTorch Conv2d Parameters

Python
import torch
import torch.nn as nn

# nn.Conv2d(in_channels, out_channels, kernel_size, stride=1, padding=0)
conv = nn.Conv2d(
    in_channels=3,     # RGB input
    out_channels=64,   # 64 different filters β†’ 64 feature maps
    kernel_size=3,     # 3Γ—3 filter
    stride=1,          # move 1 pixel at a time
    padding=1,         # pad to maintain spatial size
    bias=True,         # one bias per output channel
)

# Parameter count
n_weights = 64 Γ— 3 Γ— 3 Γ— 3   # out_channels Γ— in_channels Γ— kernel_h Γ— kernel_w
n_biases  = 64
total = n_weights + n_biases
print(f"Conv parameters: {total:,}")   # 1,792 + 64 = 1,792

# Forward pass shape
x = torch.randn(8, 3, 224, 224)   # batch=8, channels=3, height=224, width=224
y = conv(x)
print(f"Output shape: {y.shape}")  # (8, 64, 224, 224)

# With stride=2 (downsampling)
conv_stride = nn.Conv2d(3, 64, kernel_size=3, stride=2, padding=1)
y2 = conv_stride(x)
print(f"Stride=2 output: {y2.shape}")  # (8, 64, 112, 112)

Pooling

Pooling reduces spatial dimensions while preserving the most important features.

MaxPool2d (most common):
  Takes the maximum value in each window
  Keeps the strongest feature activation
  Gives spatial invariance β€” the feature just needs to appear somewhere in the window

AveragePool2d:
  Takes the average value in each window
  Smoother spatial representation
  Used for: global average pooling before classification head

GlobalAveragePool (AdaptiveAvgPool):
  Pools the ENTIRE feature map to a single value per channel
  Eliminates fixed-size constraint β€” works on any input size
  Standard in modern architectures (ResNet, EfficientNet)
Python
# MaxPool2d: 2Γ—2 window, stride 2 (default stride = kernel_size)
maxpool = nn.MaxPool2d(kernel_size=2, stride=2)
x = torch.randn(8, 64, 56, 56)
y = maxpool(x)
print(f"After MaxPool: {y.shape}")   # (8, 64, 28, 28)

# GlobalAveragePool: each 28Γ—28 map β†’ single value
gap = nn.AdaptiveAvgPool2d(output_size=1)
y2 = gap(y)
print(f"After GAP: {y2.shape}")      # (8, 64, 1, 1)

# Flatten for classifier
flat = y2.flatten(start_dim=1)
print(f"After flatten: {flat.shape}")   # (8, 64)

# The combination: GAP β†’ Flatten β†’ Linear
# is the standard CNN classification head in modern architectures
head = nn.Sequential(
    nn.AdaptiveAvgPool2d(1),
    nn.Flatten(),
    nn.Linear(64, 10),  # 10 classes
)

Feature Hierarchy

What filters learn at each depth (from visualisation research):

Layer 1 (shallow):
  Edges (horizontal, vertical, diagonal)
  Colour gradients
  Simple textures

Layer 2:
  Corners, T-junctions
  Simple textures (stripes, grids)
  Curves

Layer 3:
  Complex textures
  Object parts (eye shapes, wheel shapes, leaf shapes)
  Orientation-specific patterns

Layer 4+:
  Object prototypes
  Semantic groupings
  Increasingly abstract "concepts"

For medical imaging:
  Early layers: pixel-level gradients
  Mid layers: anatomical boundaries, tissue textures
  Late layers: pathological patterns (lesion shapes, abnormal densities)

Receptive Field

Each feature map cell "sees" a limited region of the original input.
This is the receptive field.

3Γ—3 conv, stride 1:
  Layer 1: 3Γ—3 receptive field
  Layer 2 (another 3Γ—3): 5Γ—5 receptive field
  Layer 3 (another 3Γ—3): 7Γ—7 receptive field

Two 3Γ—3 convolutions have the same receptive field as one 5Γ—5 convolution
but fewer parameters (2 Γ— 3Γ—3 = 18 vs 5Γ—5 = 25) and more non-linearity.

This is why modern CNNs use stacks of 3Γ—3 filters rather than large filters.

Interview Answer

"CNN filters are small weight matrices (typically 3Γ—3) that slide over input feature maps computing dot products β€” each filter detects a specific spatial pattern wherever it appears. Stride controls the step size (stride=2 halves the spatial dimensions); padding maintains the spatial size (padding=1 for 3Γ—3 filters). MaxPooling takes the maximum in each window, providing spatial invariance and downsampling. Global Average Pooling converts each feature map to a single value, giving the classification head a fixed-size input regardless of image dimensions. Deeper layers have larger effective receptive fields β€” early layers detect edges, deeper layers detect object parts and semantic concepts. Modern CNNs stack 3Γ—3 filters because two 3Γ—3s have the same receptive field as one 5Γ—5 with fewer parameters and more non-linearity."

Enjoyed this article?

Explore the AI Systems learning path for more.

Found this helpful?

Share:𝕏

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.