Regularisation — Interview Q&A

Q1: What is regularisation and why is it needed in deep learning?

Answer: Regularisation refers to techniques that reduce overfitting — when a model memorises training noise rather than learning generalisable patterns. In deep learning, models are often over-parameterised (more parameters than training examples), making overfitting likely without regularisation. Overfitting manifests as a large gap between training loss and validation loss. Regularisation methods constrain the model in different ways: L2 weight decay penalises large weights, Dropout randomly disables neurons, BatchNorm normalises activations, early stopping limits training time, and data augmentation increases effective dataset diversity. In clinical AI, overfitting is especially dangerous because models may perform well on the development hospital's data but fail when deployed to a different institution.

Python

import torch
import torch.nn as nn

# Regularised clinical model combining multiple techniques
class RegularisedClinicalMLP(nn.Module):
    def __init__(self, n_features: int = 20):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_features, 128),
            nn.BatchNorm1d(128),    # normalisation
            nn.ReLU(),
            nn.Dropout(0.4),        # dropout
            nn.Linear(128, 64),
            nn.BatchNorm1d(64),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(64, 1),
        )
    
    def forward(self, x):
        return self.net(x)

# Use AdamW for L2 via weight_decay
optimizer = torch.optim.AdamW(
    RegularisedClinicalMLP().parameters(),
    lr=3e-4,
    weight_decay=1e-4,   # L2 regularisation
)

Q2: How does Dropout work, and why does it help generalisation?

Answer: During training, Dropout randomly sets a fraction p of neurons to 0 on each forward pass. The scale is compensated by multiplying remaining activations by 1/(1-p) so the expected output magnitude is preserved (inverted dropout). This prevents any single neuron from becoming overly reliant on specific upstream neurons — each neuron must learn features that are useful regardless of which other neurons are present. The interpretation: Dropout trains an exponential ensemble of 2^N sub-networks (for N dropout neurons), approximated by a single model at inference. At inference, Dropout is disabled — all neurons are active. A common bug: forgetting model.eval() in validation uses the training dropout, making validation metrics noisy and unreliable.

Python

import torch
import torch.nn as nn

model = nn.Sequential(nn.Linear(10, 64), nn.Dropout(0.5), nn.ReLU(), nn.Linear(64, 1))
X = torch.randn(32, 10)

# Training mode: Dropout active
model.train()
out_1 = model(X)
out_2 = model(X)
print(f"Training: outputs differ? {not torch.allclose(out_1, out_2)}")  # True — stochastic

# Eval mode: Dropout inactive, deterministic
model.eval()
out_3 = model(X)
out_4 = model(X)
print(f"Eval: outputs identical? {torch.allclose(out_3, out_4)}")       # True

# MC Dropout: keep model.train() at inference for uncertainty estimation
model.train()
mc_preds = torch.stack([torch.sigmoid(model(X).squeeze()) for _ in range(30)])
uncertainty = mc_preds.std(dim=0)
print(f"MC Dropout uncertainty: mean={uncertainty.mean():.4f}")

Q3: What is the difference between Dropout and BatchNorm for regularisation?

Answer: Dropout and BatchNorm both regularise but through different mechanisms. Dropout: randomly silences neurons, forcing redundancy and preventing co-adaptation of neurons; requires model.eval() at inference; best for MLPs and moderate-depth CNNs. BatchNorm: normalises activations to zero mean and unit variance per mini-batch; adds noise because each mini-batch has slightly different statistics; primarily designed to stabilise training, not specifically for regularisation — the regularisation is a side effect. They interact poorly: in CNNs, using Dropout before BatchNorm degrades performance because the noise from Dropout interferes with BatchNorm's variance estimation. The standard modern practice: BatchNorm in the main body (Conv + BN + ReLU), Dropout only in the final classification head (after GAP).

Python

import torch.nn as nn

# Good practice: BN in body, Dropout only in head
class WellDesignedCNN(nn.Module):
    def __init__(self, n_classes: int):
        super().__init__()
        self.backbone = nn.Sequential(
            nn.Conv2d(3, 64, 3, padding=1, bias=False),
            nn.BatchNorm2d(64),   # BN in backbone (no Dropout here)
            nn.ReLU(),
            nn.Conv2d(64, 128, 3, padding=1, bias=False),
            nn.BatchNorm2d(128),
            nn.ReLU(),
            nn.AdaptiveAvgPool2d(1),
        )
        self.head = nn.Sequential(
            nn.Flatten(),
            nn.Dropout(0.5),       # Dropout only in classification head
            nn.Linear(128, n_classes),
        )
    
    def forward(self, x):
        return self.head(self.backbone(x))

Q4: When should you use each regularisation technique?

Answer: No single technique dominates — combine based on dataset and architecture:

Always: Early stopping (zero cost, universally applicable)
Always: L2 weight decay via AdamW (weight_decay=1e-4, default for all models)
CNNs: BatchNorm in every conv block; Dropout 0.5 in final head only
MLPs: BatchNorm + Dropout 0.3–0.5 in hidden layers
Transformers: LayerNorm + Dropout 0.1 in attention and FFN; weight decay 0.01–0.1
Small datasets (< 5K): aggressive Dropout (0.5), strong augmentation, smaller architecture
Large datasets (> 100K): mild Dropout (0.1–0.2), L2 less critical but still useful
Class imbalance: weighted loss or Focal loss (not regularisation per se, but prevents degenerate solutions)

Python

import torch.optim as optim

def get_regularisation_config(
    dataset_size: int,
    architecture: str,
) -> dict:
    if dataset_size < 5_000:
        return {"dropout": 0.5, "weight_decay": 1e-3, "label_smoothing": 0.1}
    elif dataset_size < 50_000:
        return {"dropout": 0.3, "weight_decay": 1e-4, "label_smoothing": 0.05}
    else:
        return {"dropout": 0.1, "weight_decay": 1e-5, "label_smoothing": 0.0}

config = get_regularisation_config(dataset_size=10_000, architecture="mlp")
print(config)

Q5: A clinical model shows AUC=0.92 on development data but AUC=0.71 on an external validation set. What do you do?

Answer: A 0.21 AUC drop between development and external validation is a major red flag indicating overfitting to the development site. Investigation and remediation steps:

Diagnose the source: compare feature distributions between sites (age, comorbidities, scanner type). If distributions differ significantly, it's domain shift, not just overfitting.
Add regularisation: increase Dropout, add weight decay, reduce architecture complexity.
Data leakage audit: check for temporal leakage (future data in training), site leakage (same patient in train and test), or preprocessing leakage (scaler fit on all data).
Site-aware training: if multi-site data is available, use GroupKFold by site for validation to detect overfitting to any single site during development.
Domain adaptation: if external data is available (even unlabelled), use domain adaptation techniques (feature normalisation, adversarial training).

Python

import torch
import torch.nn as nn
import numpy as np
from sklearn.metrics import roc_auc_score

def evaluate_cross_site(
    model: nn.Module,
    site_data: dict[str, tuple],
    device: torch.device,
) -> None:
    """Evaluate model on each site separately to identify performance gaps."""
    model.eval()
    aucs = {}
    
    for site, (X, y) in site_data.items():
        X = X.to(device)
        with torch.no_grad():
            probs = torch.sigmoid(model(X).squeeze()).cpu().numpy()
        y_np = y.numpy()
        
        if y_np.sum() > 0 and y_np.sum() < len(y_np):
            auc = roc_auc_score(y_np, probs)
            aucs[site] = auc
            print(f"Site {site:20s}: AUC={auc:.4f}, n={len(y_np)}")
    
    mean_auc = np.mean(list(aucs.values()))
    print(f"\nMean AUC: {mean_auc:.4f}")
    for site, auc in aucs.items():
        if mean_auc - auc > 0.05:
            print(f"WARNING: {site} underperforms by {mean_auc - auc:.3f}")

Interview Answer

"Regularisation prevents overfitting by constraining model complexity. Key techniques: (1) L2 weight decay — penalises large weights, equivalent to Gaussian prior; use AdamW with weight_decay=1e-4; (2) Dropout — randomly silences neurons during training, forcing redundancy; always disable with model.eval() for validation and inference; (3) BatchNorm — normalises activations per mini-batch, stabilises training with a regularising side effect; use LayerNorm for transformers; (4) Early stopping — stop at best validation epoch, restore saved weights; (5) Data augmentation — preserves labels while creating diverse training examples. For clinical AI: an AUC drop from development to external sites signals overfitting to development data — audit for leakage first (temporal, site-level, preprocessing), then increase regularisation. Always validate on an external holdout set that reflects the target deployment population."

Regularisation — Interview Q&A

Q1: What is regularisation and why is it needed in deep learning?

Q2: How does Dropout work, and why does it help generalisation?

Q3: What is the difference between Dropout and BatchNorm for regularisation?

Q4: When should you use each regularisation technique?

Q5: A clinical model shows AUC=0.92 on development data but AUC=0.71 on an external validation set. What do you do?

Interview Answer

Enjoyed this article?

Leave a comment