Deep Learning for AI Interviews · Lesson 13 of 56
What is a CNN and Why Is It Used for Images?
Why Not Just Fully Connected?
A 224×224 RGB image has 224 × 224 × 3 = 150,528 pixels.
Fully connected to first hidden layer of 1024 neurons:
Parameters: 150,528 × 1,024 = 154M — just for the first layer!
This ignores spatial structure: the network has no knowledge
that neighbouring pixels are related.
CNN exploits two properties of images:
1. Local connectivity: nearby pixels are related; distant pixels are not
2. Translation invariance: a cat is a cat whether in top-left or bottom-rightThe Convolution Operation
A filter (kernel) slides across the input image.
At each position, it computes the dot product with the patch it covers.
Kernel (3×3 edge detector):
[[-1, -1, -1],
[ 0, 0, 0],
[ 1, 1, 1]]
Sliding this kernel across an image:
At each position: sum of (kernel × image_patch)
Produces a feature map: high values where this pattern exists
Key properties:
Local: kernel only sees a 3×3 patch at once
Shared weights: the SAME kernel is applied at every position
Equivariant: if the pattern shifts, the high activation shifts tooKey Components of a CNN
Conv layer: applies n_filters kernels → n_filters feature maps
MaxPool: downsamples (reduces spatial size), takes max in each window
ReLU: non-linearity (element-wise)
BatchNorm: normalises feature maps across the batch
Flatten: converts spatial feature maps to a vector
Fully connected: final classification layer
Typical structure:
[Conv → BN → ReLU → MaxPool] × k layers (feature extraction)
[Flatten → Linear → ReLU → Dropout] × m (classification head)PyTorch CNN Implementation
import torch
import torch.nn as nn
import torch.nn.functional as F
class SimpleCNN(nn.Module):
"""Simple CNN for binary image classification."""
def __init__(self, in_channels: int = 3, n_classes: int = 2):
super().__init__()
# Feature extraction
self.features = nn.Sequential(
# Block 1: (B, 3, 224, 224) → (B, 32, 112, 112)
nn.Conv2d(in_channels, 32, kernel_size=3, padding=1),
nn.BatchNorm2d(32),
nn.ReLU(),
nn.MaxPool2d(2, 2),
# Block 2: (B, 32, 112, 112) → (B, 64, 56, 56)
nn.Conv2d(32, 64, kernel_size=3, padding=1),
nn.BatchNorm2d(64),
nn.ReLU(),
nn.MaxPool2d(2, 2),
# Block 3: (B, 64, 56, 56) → (B, 128, 28, 28)
nn.Conv2d(64, 128, kernel_size=3, padding=1),
nn.BatchNorm2d(128),
nn.ReLU(),
nn.MaxPool2d(2, 2),
)
# Global average pooling: (B, 128, 28, 28) → (B, 128)
self.global_pool = nn.AdaptiveAvgPool2d(1)
# Classification head
self.classifier = nn.Sequential(
nn.Flatten(),
nn.Linear(128, 64),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(64, n_classes),
)
def forward(self, x: torch.Tensor) -> torch.Tensor:
x = self.features(x)
x = self.global_pool(x)
x = self.classifier(x)
return x
# Test
model = SimpleCNN(in_channels=3, n_classes=2)
x = torch.randn(8, 3, 224, 224) # batch of 8 RGB images
output = model(x)
print(f"Output shape: {output.shape}") # (8, 2)
# Parameter count
n_params = sum(p.numel() for p in model.parameters())
print(f"Parameters: {n_params:,}") # ~200K — much less than fully connected!How Convolution Reduces Parameters
Fully connected (150K input → 1024):
Parameters: 150,528 × 1,024 = 154M
CNN first layer (3-channel input, 32 filters, 3×3 kernel):
Parameters: 32 × (3 × 3 × 3 + 1) = 896 ← 172,000× fewer
Why fewer? Weight sharing:
The same 3×3×3 kernel (27 weights) is applied at ALL 224×224 = 50,176 positions
Instead of 50,176 × 27 unique weights, we have just 27 shared weights
This sharing forces the network to learn position-invariant features:
"Edge detector" works everywhere, not just in the top-left corner.CNN vs Fully Connected Comparison
| Fully Connected | CNN
----------------|--------------------|-----------------------
Parameters | Massive (150M+) | Efficient (1M for same task)
Spatial structure | Ignored | Exploited via local receptive fields
Translation inv.| No | Approximate (via pooling)
Good for images | No | Yes
Good for tabular| Yes | No
Memory | Very high | Much lower
SOTA on images | No | ResNet, ViT, EfficientNet
For medical imaging (X-ray, histology, fundus photos):
CNNs (and Vision Transformers) are the standard approach.Interview Answer
"CNNs use convolutional layers — a small kernel (typically 3×3) slides across the input and computes dot products at every position. Weight sharing: the same kernel is applied everywhere, giving translation equivariance (features detected regardless of position) and drastically reducing parameters (32 kernels × 3×3×3 weights = 896 parameters vs 154M for a fully connected first layer on an image). The architecture alternates convolution + activation + batch norm + pooling blocks to progressively extract features from edges to objects. CNNs dominated computer vision until Vision Transformers (ViTs) competed, but even modern hybrid architectures (ConvNeXt) use convolutional ideas. For medical imaging (X-ray, ECG as image, histology), a pre-trained ResNet or EfficientNet fine-tuned on the clinical task is the standard approach."