Learnixo
Back to blog
AI Systemsbeginner

Convolutional Neural Networks: Introduction

What CNNs are, why convolution works for images, the key components, and how they compare to fully connected networks.

Asma Hafeez KhanMay 21, 20264 min read
Deep LearningCNNConvolutional NetworksComputer VisionInterview
Share:š•

Why Not Just Fully Connected?

A 224Ɨ224 RGB image has 224 Ɨ 224 Ɨ 3 = 150,528 pixels.

Fully connected to first hidden layer of 1024 neurons:
  Parameters: 150,528 Ɨ 1,024 = 154M — just for the first layer!
  
  This ignores spatial structure: the network has no knowledge
  that neighbouring pixels are related.

CNN exploits two properties of images:
  1. Local connectivity: nearby pixels are related; distant pixels are not
  2. Translation invariance: a cat is a cat whether in top-left or bottom-right

The Convolution Operation

A filter (kernel) slides across the input image.
At each position, it computes the dot product with the patch it covers.

Kernel (3Ɨ3 edge detector):
  [[-1, -1, -1],
   [ 0,  0,  0],
   [ 1,  1,  1]]

Sliding this kernel across an image:
  At each position: sum of (kernel Ɨ image_patch)
  Produces a feature map: high values where this pattern exists

Key properties:
  Local: kernel only sees a 3Ɨ3 patch at once
  Shared weights: the SAME kernel is applied at every position
  Equivariant: if the pattern shifts, the high activation shifts too

Key Components of a CNN

Conv layer:   applies n_filters kernels → n_filters feature maps
MaxPool:      downsamples (reduces spatial size), takes max in each window
ReLU:         non-linearity (element-wise)
BatchNorm:    normalises feature maps across the batch
Flatten:      converts spatial feature maps to a vector
Fully connected: final classification layer

Typical structure:
  [Conv → BN → ReLU → MaxPool] Ɨ k layers   (feature extraction)
  [Flatten → Linear → ReLU → Dropout] Ɨ m   (classification head)

PyTorch CNN Implementation

Python
import torch
import torch.nn as nn
import torch.nn.functional as F

class SimpleCNN(nn.Module):
    """Simple CNN for binary image classification."""
    
    def __init__(self, in_channels: int = 3, n_classes: int = 2):
        super().__init__()
        
        # Feature extraction
        self.features = nn.Sequential(
            # Block 1: (B, 3, 224, 224) → (B, 32, 112, 112)
            nn.Conv2d(in_channels, 32, kernel_size=3, padding=1),
            nn.BatchNorm2d(32),
            nn.ReLU(),
            nn.MaxPool2d(2, 2),
            
            # Block 2: (B, 32, 112, 112) → (B, 64, 56, 56)
            nn.Conv2d(32, 64, kernel_size=3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(),
            nn.MaxPool2d(2, 2),
            
            # Block 3: (B, 64, 56, 56) → (B, 128, 28, 28)
            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(),
            nn.MaxPool2d(2, 2),
        )
        
        # Global average pooling: (B, 128, 28, 28) → (B, 128)
        self.global_pool = nn.AdaptiveAvgPool2d(1)
        
        # Classification head
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(128, 64),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(64, n_classes),
        )
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = self.features(x)
        x = self.global_pool(x)
        x = self.classifier(x)
        return x


# Test
model = SimpleCNN(in_channels=3, n_classes=2)
x = torch.randn(8, 3, 224, 224)    # batch of 8 RGB images
output = model(x)
print(f"Output shape: {output.shape}")   # (8, 2)

# Parameter count
n_params = sum(p.numel() for p in model.parameters())
print(f"Parameters: {n_params:,}")       # ~200K — much less than fully connected!

How Convolution Reduces Parameters

Fully connected (150K input → 1024):
  Parameters: 150,528 Ɨ 1,024 = 154M

CNN first layer (3-channel input, 32 filters, 3Ɨ3 kernel):
  Parameters: 32 Ɨ (3 Ɨ 3 Ɨ 3 + 1) = 896   ← 172,000Ɨ fewer

Why fewer? Weight sharing:
  The same 3Ɨ3Ɨ3 kernel (27 weights) is applied at ALL 224Ɨ224 = 50,176 positions
  Instead of 50,176 Ɨ 27 unique weights, we have just 27 shared weights

This sharing forces the network to learn position-invariant features:
  "Edge detector" works everywhere, not just in the top-left corner.

CNN vs Fully Connected Comparison

                | Fully Connected    | CNN
----------------|--------------------|-----------------------
Parameters      | Massive (150M+)    | Efficient (1M for same task)
Spatial structure | Ignored          | Exploited via local receptive fields
Translation inv.| No                 | Approximate (via pooling)
Good for images | No                 | Yes
Good for tabular| Yes                | No
Memory          | Very high          | Much lower
SOTA on images  | No                 | ResNet, ViT, EfficientNet

For medical imaging (X-ray, histology, fundus photos):
  CNNs (and Vision Transformers) are the standard approach.

Interview Answer

"CNNs use convolutional layers — a small kernel (typically 3Ɨ3) slides across the input and computes dot products at every position. Weight sharing: the same kernel is applied everywhere, giving translation equivariance (features detected regardless of position) and drastically reducing parameters (32 kernels Ɨ 3Ɨ3Ɨ3 weights = 896 parameters vs 154M for a fully connected first layer on an image). The architecture alternates convolution + activation + batch norm + pooling blocks to progressively extract features from edges to objects. CNNs dominated computer vision until Vision Transformers (ViTs) competed, but even modern hybrid architectures (ConvNeXt) use convolutional ideas. For medical imaging (X-ray, ECG as image, histology), a pre-trained ResNet or EfficientNet fine-tuned on the clinical task is the standard approach."

Enjoyed this article?

Explore the AI Systems learning path for more.

Found this helpful?

Share:š•

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.