Deep Learning for AI Interviews · Lesson 13 of 56

What is a CNN and Why Is It Used for Images?

Why Not Just Fully Connected?

A 224×224 RGB image has 224 × 224 × 3 = 150,528 pixels.

Fully connected to first hidden layer of 1024 neurons:
  Parameters: 150,528 × 1,024 = 154M — just for the first layer!
  
  This ignores spatial structure: the network has no knowledge
  that neighbouring pixels are related.

CNN exploits two properties of images:
  1. Local connectivity: nearby pixels are related; distant pixels are not
  2. Translation invariance: a cat is a cat whether in top-left or bottom-right

The Convolution Operation

A filter (kernel) slides across the input image.
At each position, it computes the dot product with the patch it covers.

Kernel (3×3 edge detector):
  [[-1, -1, -1],
   [ 0,  0,  0],
   [ 1,  1,  1]]

Sliding this kernel across an image:
  At each position: sum of (kernel × image_patch)
  Produces a feature map: high values where this pattern exists

Key properties:
  Local: kernel only sees a 3×3 patch at once
  Shared weights: the SAME kernel is applied at every position
  Equivariant: if the pattern shifts, the high activation shifts too

Key Components of a CNN

Conv layer:   applies n_filters kernels → n_filters feature maps
MaxPool:      downsamples (reduces spatial size), takes max in each window
ReLU:         non-linearity (element-wise)
BatchNorm:    normalises feature maps across the batch
Flatten:      converts spatial feature maps to a vector
Fully connected: final classification layer

Typical structure:
  [Conv → BN → ReLU → MaxPool] × k layers   (feature extraction)
  [Flatten → Linear → ReLU → Dropout] × m   (classification head)

PyTorch CNN Implementation

Python

import torch
import torch.nn as nn
import torch.nn.functional as F

class SimpleCNN(nn.Module):
    """Simple CNN for binary image classification."""
    
    def __init__(self, in_channels: int = 3, n_classes: int = 2):
        super().__init__()
        
        # Feature extraction
        self.features = nn.Sequential(
            # Block 1: (B, 3, 224, 224) → (B, 32, 112, 112)
            nn.Conv2d(in_channels, 32, kernel_size=3, padding=1),
            nn.BatchNorm2d(32),
            nn.ReLU(),
            nn.MaxPool2d(2, 2),
            
            # Block 2: (B, 32, 112, 112) → (B, 64, 56, 56)
            nn.Conv2d(32, 64, kernel_size=3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(),
            nn.MaxPool2d(2, 2),
            
            # Block 3: (B, 64, 56, 56) → (B, 128, 28, 28)
            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(),
            nn.MaxPool2d(2, 2),
        )
        
        # Global average pooling: (B, 128, 28, 28) → (B, 128)
        self.global_pool = nn.AdaptiveAvgPool2d(1)
        
        # Classification head
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(128, 64),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(64, n_classes),
        )
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = self.features(x)
        x = self.global_pool(x)
        x = self.classifier(x)
        return x


# Test
model = SimpleCNN(in_channels=3, n_classes=2)
x = torch.randn(8, 3, 224, 224)    # batch of 8 RGB images
output = model(x)
print(f"Output shape: {output.shape}")   # (8, 2)

# Parameter count
n_params = sum(p.numel() for p in model.parameters())
print(f"Parameters: {n_params:,}")       # ~200K — much less than fully connected!

How Convolution Reduces Parameters

Fully connected (150K input → 1024):
  Parameters: 150,528 × 1,024 = 154M

CNN first layer (3-channel input, 32 filters, 3×3 kernel):
  Parameters: 32 × (3 × 3 × 3 + 1) = 896   ← 172,000× fewer

Why fewer? Weight sharing:
  The same 3×3×3 kernel (27 weights) is applied at ALL 224×224 = 50,176 positions
  Instead of 50,176 × 27 unique weights, we have just 27 shared weights

This sharing forces the network to learn position-invariant features:
  "Edge detector" works everywhere, not just in the top-left corner.

CNN vs Fully Connected Comparison

                | Fully Connected    | CNN
----------------|--------------------|-----------------------
Parameters      | Massive (150M+)    | Efficient (1M for same task)
Spatial structure | Ignored          | Exploited via local receptive fields
Translation inv.| No                 | Approximate (via pooling)
Good for images | No                 | Yes
Good for tabular| Yes                | No
Memory          | Very high          | Much lower
SOTA on images  | No                 | ResNet, ViT, EfficientNet

For medical imaging (X-ray, histology, fundus photos):
  CNNs (and Vision Transformers) are the standard approach.

Interview Answer

"CNNs use convolutional layers — a small kernel (typically 3×3) slides across the input and computes dot products at every position. Weight sharing: the same kernel is applied everywhere, giving translation equivariance (features detected regardless of position) and drastically reducing parameters (32 kernels × 3×3×3 weights = 896 parameters vs 154M for a fully connected first layer on an image). The architecture alternates convolution + activation + batch norm + pooling blocks to progressively extract features from edges to objects. CNNs dominated computer vision until Vision Transformers (ViTs) competed, but even modern hybrid architectures (ConvNeXt) use convolutional ideas. For medical imaging (X-ray, ECG as image, histology), a pre-trained ResNet or EfficientNet fine-tuned on the clinical task is the standard approach."

Batch Norm, Early Stopping, and Data Augmentation

Next Lesson

Filters, Pooling, and Receptive Fields