Convolutional Neural Networks: Introduction
What CNNs are, why convolution works for images, the key components, and how they compare to fully connected networks.
Why Not Just Fully Connected?
A 224Ć224 RGB image has 224 Ć 224 Ć 3 = 150,528 pixels.
Fully connected to first hidden layer of 1024 neurons:
Parameters: 150,528 Ć 1,024 = 154M ā just for the first layer!
This ignores spatial structure: the network has no knowledge
that neighbouring pixels are related.
CNN exploits two properties of images:
1. Local connectivity: nearby pixels are related; distant pixels are not
2. Translation invariance: a cat is a cat whether in top-left or bottom-rightThe Convolution Operation
A filter (kernel) slides across the input image.
At each position, it computes the dot product with the patch it covers.
Kernel (3Ć3 edge detector):
[[-1, -1, -1],
[ 0, 0, 0],
[ 1, 1, 1]]
Sliding this kernel across an image:
At each position: sum of (kernel Ć image_patch)
Produces a feature map: high values where this pattern exists
Key properties:
Local: kernel only sees a 3Ć3 patch at once
Shared weights: the SAME kernel is applied at every position
Equivariant: if the pattern shifts, the high activation shifts tooKey Components of a CNN
Conv layer: applies n_filters kernels ā n_filters feature maps
MaxPool: downsamples (reduces spatial size), takes max in each window
ReLU: non-linearity (element-wise)
BatchNorm: normalises feature maps across the batch
Flatten: converts spatial feature maps to a vector
Fully connected: final classification layer
Typical structure:
[Conv ā BN ā ReLU ā MaxPool] Ć k layers (feature extraction)
[Flatten ā Linear ā ReLU ā Dropout] Ć m (classification head)PyTorch CNN Implementation
import torch
import torch.nn as nn
import torch.nn.functional as F
class SimpleCNN(nn.Module):
"""Simple CNN for binary image classification."""
def __init__(self, in_channels: int = 3, n_classes: int = 2):
super().__init__()
# Feature extraction
self.features = nn.Sequential(
# Block 1: (B, 3, 224, 224) ā (B, 32, 112, 112)
nn.Conv2d(in_channels, 32, kernel_size=3, padding=1),
nn.BatchNorm2d(32),
nn.ReLU(),
nn.MaxPool2d(2, 2),
# Block 2: (B, 32, 112, 112) ā (B, 64, 56, 56)
nn.Conv2d(32, 64, kernel_size=3, padding=1),
nn.BatchNorm2d(64),
nn.ReLU(),
nn.MaxPool2d(2, 2),
# Block 3: (B, 64, 56, 56) ā (B, 128, 28, 28)
nn.Conv2d(64, 128, kernel_size=3, padding=1),
nn.BatchNorm2d(128),
nn.ReLU(),
nn.MaxPool2d(2, 2),
)
# Global average pooling: (B, 128, 28, 28) ā (B, 128)
self.global_pool = nn.AdaptiveAvgPool2d(1)
# Classification head
self.classifier = nn.Sequential(
nn.Flatten(),
nn.Linear(128, 64),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(64, n_classes),
)
def forward(self, x: torch.Tensor) -> torch.Tensor:
x = self.features(x)
x = self.global_pool(x)
x = self.classifier(x)
return x
# Test
model = SimpleCNN(in_channels=3, n_classes=2)
x = torch.randn(8, 3, 224, 224) # batch of 8 RGB images
output = model(x)
print(f"Output shape: {output.shape}") # (8, 2)
# Parameter count
n_params = sum(p.numel() for p in model.parameters())
print(f"Parameters: {n_params:,}") # ~200K ā much less than fully connected!How Convolution Reduces Parameters
Fully connected (150K input ā 1024):
Parameters: 150,528 Ć 1,024 = 154M
CNN first layer (3-channel input, 32 filters, 3Ć3 kernel):
Parameters: 32 Ć (3 Ć 3 Ć 3 + 1) = 896 ā 172,000Ć fewer
Why fewer? Weight sharing:
The same 3Ć3Ć3 kernel (27 weights) is applied at ALL 224Ć224 = 50,176 positions
Instead of 50,176 Ć 27 unique weights, we have just 27 shared weights
This sharing forces the network to learn position-invariant features:
"Edge detector" works everywhere, not just in the top-left corner.CNN vs Fully Connected Comparison
| Fully Connected | CNN
----------------|--------------------|-----------------------
Parameters | Massive (150M+) | Efficient (1M for same task)
Spatial structure | Ignored | Exploited via local receptive fields
Translation inv.| No | Approximate (via pooling)
Good for images | No | Yes
Good for tabular| Yes | No
Memory | Very high | Much lower
SOTA on images | No | ResNet, ViT, EfficientNet
For medical imaging (X-ray, histology, fundus photos):
CNNs (and Vision Transformers) are the standard approach.Interview Answer
"CNNs use convolutional layers ā a small kernel (typically 3Ć3) slides across the input and computes dot products at every position. Weight sharing: the same kernel is applied everywhere, giving translation equivariance (features detected regardless of position) and drastically reducing parameters (32 kernels Ć 3Ć3Ć3 weights = 896 parameters vs 154M for a fully connected first layer on an image). The architecture alternates convolution + activation + batch norm + pooling blocks to progressively extract features from edges to objects. CNNs dominated computer vision until Vision Transformers (ViTs) competed, but even modern hybrid architectures (ConvNeXt) use convolutional ideas. For medical imaging (X-ray, ECG as image, histology), a pre-trained ResNet or EfficientNet fine-tuned on the clinical task is the standard approach."
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.