CNN — Interview Q&A

Q1: How does a convolutional layer work and why is it better than a fully connected layer for images?

Answer: A convolutional layer slides a small weight matrix (kernel) over the input, computing dot products at each position. This gives two key properties: (1) Local connectivity — each output depends on a small local region of the input, not all pixels; (2) Parameter sharing — the same kernel is reused at every spatial position. For a 224×224 RGB image: a fully connected layer would need 224×224×3 × hidden_size ≈ 150K × hidden_size parameters. A 3×3 conv with 64 filters needs only 3×3×3×64 = 1,728 parameters — 100× fewer. CNNs also build spatial invariance: a feature detected in the top-left can be reused for the same feature in the bottom-right.

Python

import torch
import torch.nn as nn

# Parameter comparison for 224×224 RGB → 64 features
fc_layer   = nn.Linear(224*224*3, 64)   # 9.6M params
conv_layer = nn.Conv2d(3, 64, kernel_size=3, padding=1)  # 1,792 params

print(f"FC layer:   {sum(p.numel() for p in fc_layer.parameters()):,} parameters")
print(f"Conv layer: {sum(p.numel() for p in conv_layer.parameters()):,} parameters")

# Both produce similar spatial output for the first feature extraction,
# but conv is 5000× more parameter-efficient

Q2: What is the purpose of MaxPooling and Global Average Pooling?

Answer: MaxPooling (typically 2×2, stride 2) halves spatial dimensions while keeping the strongest feature activation in each window. It provides approximate spatial invariance — a feature just needs to appear somewhere in the 2×2 region. It also reduces computational cost for subsequent layers. Global Average Pooling (GAP) replaces the final feature map (e.g., 7×7×512) with a single value per channel (512,) by averaging all spatial positions. GAP has two advantages: (1) No fixed-size constraint — the model can process any input resolution; (2) Fewer parameters than FC layers (no weights needed), acting as a natural regulariser. Modern architectures (ResNet, EfficientNet) use GAP; VGG used large FC layers at the end, which is now considered obsolete.

Python

import torch
import torch.nn as nn

# MaxPool: 2× spatial downsampling
maxpool = nn.MaxPool2d(kernel_size=2, stride=2)
x = torch.randn(8, 64, 14, 14)
print(f"After MaxPool: {maxpool(x).shape}")   # (8, 64, 7, 7)

# Global Average Pool: any size → (B, C, 1, 1)
gap = nn.AdaptiveAvgPool2d(1)
print(f"After GAP:     {gap(x).shape}")        # (8, 64, 1, 1)

# Complete classifier
head = nn.Sequential(
    nn.AdaptiveAvgPool2d(1),   # (B, 512, 1, 1)
    nn.Flatten(),               # (B, 512)
    nn.Dropout(0.5),
    nn.Linear(512, 2),          # (B, 2) — binary or 2-class
)

Q3: Why did ResNet outperform much deeper plain networks?

Answer: Plain deep networks (like VGG) suffer from the degradation problem: adding more layers beyond ~20 causes training accuracy to worsen — not due to overfitting, but because the optimiser cannot learn identity mappings through many non-linear transformations. ResNet introduced residual (skip) connections: output = F(x) + x. Learning F(x) = 0 is trivial (set weights to zero), allowing the layer to learn identity when deeper representations don't help. This provides a gradient highway: dL/dx = dL/dx_ × (1 + dF/dx) — the '+1' ensures gradients flow even if F's gradient is small. ResNet-50 (25M parameters) outperforms VGG-19 (143M parameters) on ImageNet while being 5.7× more parameter-efficient.

Python

import torch
import torch.nn as nn

class ResBlock(nn.Module):
    def __init__(self, dim: int):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(dim, dim), nn.BatchNorm1d(dim), nn.ReLU(),
            nn.Linear(dim, dim), nn.BatchNorm1d(dim),
        )
        self.relu = nn.ReLU()
    
    def forward(self, x):
        return self.relu(x + self.net(x))   # H(x) = F(x) + x

# If F learns to output 0: H(x) = x (identity — no degradation)
# Gradient at x: dL/dx = dL/dH × (1 + dF/dx) — always at least dL/dH

Q4: Explain transfer learning for a chest X-ray classification task with limited data.

Answer: With 2,000 labelled chest X-rays, training a CNN from scratch would overfit. Transfer learning uses ImageNet-pretrained weights as a starting point. Strategy for this dataset size: freeze the backbone (layers 1–4), replace only the final fully connected layer, and train with a small learning rate (1e-4). For grayscale X-rays, adapt the first conv by averaging pretrained RGB weights. Use medical-appropriate augmentation: horizontal flip (L/R symmetric), small rotation (≤10°), slight brightness/contrast jitter — avoid vertical flip (inverts anatomy) or aggressive colour jitter (alters X-ray density readings). Even with domain gap, ImageNet features (edges, textures) transfer well and dramatically outperform random initialisation.

Python

import torchvision.models as models
import torch.nn as nn
import torchvision.transforms as T

def build_xray_classifier(n_classes: int = 2) -> nn.Module:
    model = models.resnet50(pretrained=True)
    
    # Adapt for grayscale input
    old_conv = model.conv1
    model.conv1 = nn.Conv2d(1, 64, kernel_size=7, stride=2, padding=3, bias=False)
    model.conv1.weight.data = old_conv.weight.data.mean(dim=1, keepdim=True)
    
    # Freeze backbone for small dataset
    for param in model.parameters():
        param.requires_grad = False
    
    # Unfreeze and replace head
    model.fc = nn.Sequential(nn.Dropout(0.5), nn.Linear(2048, n_classes))
    return model

train_transform = T.Compose([
    T.Resize(256), T.RandomCrop(224),
    T.RandomHorizontalFlip(),
    T.RandomRotation(10),
    T.ToTensor(),
    T.Normalize([0.5], [0.5]),   # grayscale normalisation
])

Q5: How do you evaluate an object detector for medical imaging?

Answer: Standard metric: mAP (mean Average Precision). For each class, compute precision-recall curve at an IoU threshold (typically 0.5 for radiological objects). AP = area under the PR curve; mAP = mean across classes. For clinical deployment, additional metrics matter: (1) Sensitivity/recall — how many true lesions are detected (missing a cancer is a Type II error with high cost); (2) False positive rate per scan — radiologists can tolerate 1–2 FP/scan; more causes alert fatigue; (3) Operating threshold — detection models output confidence scores; threshold determines the sensitivity/FP trade-off. Always present results at multiple thresholds (FROC curve — free-response ROC) for clinical stakeholders rather than a single mAP number.

Python

import torch

def compute_ap(
    pred_boxes: list, pred_scores: list, pred_labels: list,
    true_boxes: list, true_labels: list,
    iou_threshold: float = 0.5,
) -> float:
    """Simplified AP computation for single class."""
    # Sort predictions by score descending
    sorted_idx = sorted(range(len(pred_scores)), key=lambda i: pred_scores[i], reverse=True)
    
    tp = []
    fp = []
    n_gt = sum(len(b) for b in true_boxes)
    
    matched = [set() for _ in true_boxes]
    
    for idx in sorted_idx:
        # Check if this prediction matches any unmatched ground truth
        # (simplified — real implementation uses IoU per image)
        tp.append(1)   # placeholder
        fp.append(0)   # placeholder
    
    # In real implementation: compute precision-recall from tp/fp lists
    # Return area under PR curve (AP)
    return 0.0   # placeholder — use torchvision.ops.box_area in practice

Q6: What data augmentation strategies work for clinical imaging?

Answer: Augmentation must preserve clinical validity. Safe: horizontal flip (chest, pathology — L/R symmetric), small rotations (≤15°), mild brightness/contrast (±10–20%), random crops. Caution: for retinal images, vertical flip is valid (symmetric); for ECG/time-series, time warping and noise injection are valid. Avoid: vertical flip for chest X-rays (lung orientation matters), extreme colour jitter (changes HU values in CT), excessive cropping (may remove the lesion). Advanced: Mixup (blend two images and labels), CutMix (paste a patch from one image into another). For highly imbalanced clinical datasets: oversample minority class via augmentation, use class-weighted loss (pos_weight or CrossEntropyLoss weight parameter), or synthetic augmentation via GAN or diffusion models (with caution about hallucinated pathology).

Python

import torchvision.transforms as T

# Safe clinical augmentation for chest X-ray
clinical_train_transform = T.Compose([
    T.Resize(320),
    T.RandomResizedCrop(224, scale=(0.8, 1.0)),   # mild scale variation
    T.RandomHorizontalFlip(0.5),                   # OK: L/R symmetric
    T.RandomRotation(degrees=10),                  # small rotation
    T.ColorJitter(brightness=0.1, contrast=0.1),  # mild intensity shift
    T.ToTensor(),
    T.Normalize(mean=[0.5], std=[0.5]),
    T.RandomErasing(p=0.1, scale=(0.01, 0.05)),  # simulate small occlusions
])

# Aggressive augmentation for pathology slides (more robust to transforms)
path_train_transform = T.Compose([
    T.RandomHorizontalFlip(),
    T.RandomVerticalFlip(),       # OK: pathology is rotation-invariant
    T.RandomRotation(90),         # full 90° rotation is valid
    T.ColorJitter(brightness=0.3, contrast=0.3, saturation=0.3, hue=0.05),
    T.ToTensor(),
])

Interview Answer

"CNNs are ideal for images because convolutions exploit spatial locality (nearby pixels are correlated) and translation invariance via parameter sharing. Key architecture choices: 3×3 conv stacks with stride-2 downsampling, BatchNorm + ReLU after each conv, global average pooling before the classification head. ResNet's skip connections (output = F(x) + x) solved degradation by providing a gradient highway, enabling 100+ layer networks. For clinical AI: use pretrained ResNet50 as backbone, adapt the first conv for grayscale, freeze layers based on dataset size, apply clinically-valid augmentations only. Object detection adds IoU and NMS on top of classification; evaluate with FROC curves rather than mAP for clinical stakeholders. The most important interview point: always check whether a pre-trained model is appropriate or whether domain-specific pretraining (CheXpert, BioViL) would better reduce the domain gap."

CNN — Interview Q&A

Q1: How does a convolutional layer work and why is it better than a fully connected layer for images?

Q2: What is the purpose of MaxPooling and Global Average Pooling?

Q3: Why did ResNet outperform much deeper plain networks?

Q4: Explain transfer learning for a chest X-ray classification task with limited data.

Q5: How do you evaluate an object detector for medical imaging?

Q6: What data augmentation strategies work for clinical imaging?

Interview Answer

Enjoyed this article?

Leave a comment