CNN ā Interview Q&A
Six key interview questions on CNN architecture, filters, pooling, skip connections, transfer learning, and medical imaging applications.
Q1: How does a convolutional layer work and why is it better than a fully connected layer for images?
Answer: A convolutional layer slides a small weight matrix (kernel) over the input, computing dot products at each position. This gives two key properties: (1) Local connectivity ā each output depends on a small local region of the input, not all pixels; (2) Parameter sharing ā the same kernel is reused at every spatial position. For a 224Ć224 RGB image: a fully connected layer would need 224Ć224Ć3 Ć hidden_size ā 150K Ć hidden_size parameters. A 3Ć3 conv with 64 filters needs only 3Ć3Ć3Ć64 = 1,728 parameters ā 100Ć fewer. CNNs also build spatial invariance: a feature detected in the top-left can be reused for the same feature in the bottom-right.
import torch
import torch.nn as nn
# Parameter comparison for 224Ć224 RGB ā 64 features
fc_layer = nn.Linear(224*224*3, 64) # 9.6M params
conv_layer = nn.Conv2d(3, 64, kernel_size=3, padding=1) # 1,792 params
print(f"FC layer: {sum(p.numel() for p in fc_layer.parameters()):,} parameters")
print(f"Conv layer: {sum(p.numel() for p in conv_layer.parameters()):,} parameters")
# Both produce similar spatial output for the first feature extraction,
# but conv is 5000Ć more parameter-efficientQ2: What is the purpose of MaxPooling and Global Average Pooling?
Answer: MaxPooling (typically 2Ć2, stride 2) halves spatial dimensions while keeping the strongest feature activation in each window. It provides approximate spatial invariance ā a feature just needs to appear somewhere in the 2Ć2 region. It also reduces computational cost for subsequent layers. Global Average Pooling (GAP) replaces the final feature map (e.g., 7Ć7Ć512) with a single value per channel (512,) by averaging all spatial positions. GAP has two advantages: (1) No fixed-size constraint ā the model can process any input resolution; (2) Fewer parameters than FC layers (no weights needed), acting as a natural regulariser. Modern architectures (ResNet, EfficientNet) use GAP; VGG used large FC layers at the end, which is now considered obsolete.
import torch
import torch.nn as nn
# MaxPool: 2Ć spatial downsampling
maxpool = nn.MaxPool2d(kernel_size=2, stride=2)
x = torch.randn(8, 64, 14, 14)
print(f"After MaxPool: {maxpool(x).shape}") # (8, 64, 7, 7)
# Global Average Pool: any size ā (B, C, 1, 1)
gap = nn.AdaptiveAvgPool2d(1)
print(f"After GAP: {gap(x).shape}") # (8, 64, 1, 1)
# Complete classifier
head = nn.Sequential(
nn.AdaptiveAvgPool2d(1), # (B, 512, 1, 1)
nn.Flatten(), # (B, 512)
nn.Dropout(0.5),
nn.Linear(512, 2), # (B, 2) ā binary or 2-class
)Q3: Why did ResNet outperform much deeper plain networks?
Answer: Plain deep networks (like VGG) suffer from the degradation problem: adding more layers beyond ~20 causes training accuracy to worsen ā not due to overfitting, but because the optimiser cannot learn identity mappings through many non-linear transformations. ResNet introduced residual (skip) connections: output = F(x) + x. Learning F(x) = 0 is trivial (set weights to zero), allowing the layer to learn identity when deeper representations don't help. This provides a gradient highway: dL/dx = dL/dx_ Ć (1 + dF/dx) ā the '+1' ensures gradients flow even if F's gradient is small. ResNet-50 (25M parameters) outperforms VGG-19 (143M parameters) on ImageNet while being 5.7Ć more parameter-efficient.
import torch
import torch.nn as nn
class ResBlock(nn.Module):
def __init__(self, dim: int):
super().__init__()
self.net = nn.Sequential(
nn.Linear(dim, dim), nn.BatchNorm1d(dim), nn.ReLU(),
nn.Linear(dim, dim), nn.BatchNorm1d(dim),
)
self.relu = nn.ReLU()
def forward(self, x):
return self.relu(x + self.net(x)) # H(x) = F(x) + x
# If F learns to output 0: H(x) = x (identity ā no degradation)
# Gradient at x: dL/dx = dL/dH Ć (1 + dF/dx) ā always at least dL/dHQ4: Explain transfer learning for a chest X-ray classification task with limited data.
Answer: With 2,000 labelled chest X-rays, training a CNN from scratch would overfit. Transfer learning uses ImageNet-pretrained weights as a starting point. Strategy for this dataset size: freeze the backbone (layers 1ā4), replace only the final fully connected layer, and train with a small learning rate (1e-4). For grayscale X-rays, adapt the first conv by averaging pretrained RGB weights. Use medical-appropriate augmentation: horizontal flip (L/R symmetric), small rotation (ā¤10°), slight brightness/contrast jitter ā avoid vertical flip (inverts anatomy) or aggressive colour jitter (alters X-ray density readings). Even with domain gap, ImageNet features (edges, textures) transfer well and dramatically outperform random initialisation.
import torchvision.models as models
import torch.nn as nn
import torchvision.transforms as T
def build_xray_classifier(n_classes: int = 2) -> nn.Module:
model = models.resnet50(pretrained=True)
# Adapt for grayscale input
old_conv = model.conv1
model.conv1 = nn.Conv2d(1, 64, kernel_size=7, stride=2, padding=3, bias=False)
model.conv1.weight.data = old_conv.weight.data.mean(dim=1, keepdim=True)
# Freeze backbone for small dataset
for param in model.parameters():
param.requires_grad = False
# Unfreeze and replace head
model.fc = nn.Sequential(nn.Dropout(0.5), nn.Linear(2048, n_classes))
return model
train_transform = T.Compose([
T.Resize(256), T.RandomCrop(224),
T.RandomHorizontalFlip(),
T.RandomRotation(10),
T.ToTensor(),
T.Normalize([0.5], [0.5]), # grayscale normalisation
])Q5: How do you evaluate an object detector for medical imaging?
Answer: Standard metric: mAP (mean Average Precision). For each class, compute precision-recall curve at an IoU threshold (typically 0.5 for radiological objects). AP = area under the PR curve; mAP = mean across classes. For clinical deployment, additional metrics matter: (1) Sensitivity/recall ā how many true lesions are detected (missing a cancer is a Type II error with high cost); (2) False positive rate per scan ā radiologists can tolerate 1ā2 FP/scan; more causes alert fatigue; (3) Operating threshold ā detection models output confidence scores; threshold determines the sensitivity/FP trade-off. Always present results at multiple thresholds (FROC curve ā free-response ROC) for clinical stakeholders rather than a single mAP number.
import torch
def compute_ap(
pred_boxes: list, pred_scores: list, pred_labels: list,
true_boxes: list, true_labels: list,
iou_threshold: float = 0.5,
) -> float:
"""Simplified AP computation for single class."""
# Sort predictions by score descending
sorted_idx = sorted(range(len(pred_scores)), key=lambda i: pred_scores[i], reverse=True)
tp = []
fp = []
n_gt = sum(len(b) for b in true_boxes)
matched = [set() for _ in true_boxes]
for idx in sorted_idx:
# Check if this prediction matches any unmatched ground truth
# (simplified ā real implementation uses IoU per image)
tp.append(1) # placeholder
fp.append(0) # placeholder
# In real implementation: compute precision-recall from tp/fp lists
# Return area under PR curve (AP)
return 0.0 # placeholder ā use torchvision.ops.box_area in practiceQ6: What data augmentation strategies work for clinical imaging?
Answer: Augmentation must preserve clinical validity. Safe: horizontal flip (chest, pathology ā L/R symmetric), small rotations (ā¤15°), mild brightness/contrast (±10ā20%), random crops. Caution: for retinal images, vertical flip is valid (symmetric); for ECG/time-series, time warping and noise injection are valid. Avoid: vertical flip for chest X-rays (lung orientation matters), extreme colour jitter (changes HU values in CT), excessive cropping (may remove the lesion). Advanced: Mixup (blend two images and labels), CutMix (paste a patch from one image into another). For highly imbalanced clinical datasets: oversample minority class via augmentation, use class-weighted loss (pos_weight or CrossEntropyLoss weight parameter), or synthetic augmentation via GAN or diffusion models (with caution about hallucinated pathology).
import torchvision.transforms as T
# Safe clinical augmentation for chest X-ray
clinical_train_transform = T.Compose([
T.Resize(320),
T.RandomResizedCrop(224, scale=(0.8, 1.0)), # mild scale variation
T.RandomHorizontalFlip(0.5), # OK: L/R symmetric
T.RandomRotation(degrees=10), # small rotation
T.ColorJitter(brightness=0.1, contrast=0.1), # mild intensity shift
T.ToTensor(),
T.Normalize(mean=[0.5], std=[0.5]),
T.RandomErasing(p=0.1, scale=(0.01, 0.05)), # simulate small occlusions
])
# Aggressive augmentation for pathology slides (more robust to transforms)
path_train_transform = T.Compose([
T.RandomHorizontalFlip(),
T.RandomVerticalFlip(), # OK: pathology is rotation-invariant
T.RandomRotation(90), # full 90° rotation is valid
T.ColorJitter(brightness=0.3, contrast=0.3, saturation=0.3, hue=0.05),
T.ToTensor(),
])Interview Answer
"CNNs are ideal for images because convolutions exploit spatial locality (nearby pixels are correlated) and translation invariance via parameter sharing. Key architecture choices: 3Ć3 conv stacks with stride-2 downsampling, BatchNorm + ReLU after each conv, global average pooling before the classification head. ResNet's skip connections (output = F(x) + x) solved degradation by providing a gradient highway, enabling 100+ layer networks. For clinical AI: use pretrained ResNet50 as backbone, adapt the first conv for grayscale, freeze layers based on dataset size, apply clinically-valid augmentations only. Object detection adds IoU and NMS on top of classification; evaluate with FROC curves rather than mAP for clinical stakeholders. The most important interview point: always check whether a pre-trained model is appropriate or whether domain-specific pretraining (CheXpert, BioViL) would better reduce the domain gap."
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.