Deep Learning for AI Interviews · Lesson 32 of 56
Interview: Design a Neural Network for a Given Task
Q1: How does a neural network learn?
Answer: A neural network learns by iteratively reducing a loss function through gradient descent. The forward pass propagates input through layers (Z = X·W + b, then activation), producing predictions. The loss function measures prediction error against ground truth. The backward pass uses autograd (chain rule) to compute dL/dW for each parameter. The optimiser (AdamW) updates weights: W ← W - α·grad. Repeated over many batches, the weights converge to values that minimise training loss.
import torch
import torch.nn as nn
model = nn.Sequential(nn.Linear(10, 32), nn.ReLU(), nn.Linear(32, 1))
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4)
criterion = nn.BCEWithLogitsLoss()
X = torch.randn(32, 10)
y = torch.randint(0, 2, (32,)).float()
# The four steps of learning
optimizer.zero_grad() # 1. clear previous gradients
loss = criterion(model(X).squeeze(), y) # 2. forward + loss
loss.backward() # 3. backward (compute dL/dW)
optimizer.step() # 4. update weightsQ2: Why do we need activation functions?
Answer: Without activation functions, any number of linear layers collapse into a single linear transformation: Layer2(Layer1(x)) = W2(W1x + b1) + b2 = (W2W1)x + (W2b1 + b2) — still linear. Non-linear activations break this, allowing the network to approximate non-linear functions. This is required for any task beyond linearly separable data (e.g., XOR is not linearly separable). The Universal Approximation Theorem states that a single hidden layer with a non-polynomial activation can approximate any continuous function. ReLU is the standard hidden-layer activation because it doesn't saturate for positive inputs (no vanishing gradient) and is computationally efficient.
import torch
import torch.nn as nn
# Without activation: just a linear model regardless of depth
linear_stack = nn.Sequential(nn.Linear(10, 32), nn.Linear(32, 1))
# This is equivalent to nn.Linear(10, 1) — depth adds nothing
# With activation: can learn non-linear boundaries
nonlinear_stack = nn.Sequential(nn.Linear(10, 32), nn.ReLU(), nn.Linear(32, 1))
# Verify: stack of linear layers is still linear
W_equiv = linear_stack[1].weight @ linear_stack[0].weight
print(f"Weight product shape: {W_equiv.shape}") # (1, 10) — one linear layerQ3: What is the difference between underfitting and overfitting?
Answer: Underfitting: the model is too simple to capture the true pattern — high training loss AND high validation loss. Causes: insufficient capacity, too few epochs, too much regularisation. Fix: increase capacity (more layers/neurons), train longer, reduce regularisation. Overfitting: the model has memorised training noise — low training loss but HIGH validation loss (large gap). Causes: too much capacity relative to data, insufficient regularisation. Fix: add Dropout, weight decay, data augmentation, or reduce model size. The ideal is a small train/val gap with both losses low.
import torch
import torch.nn as nn
def check_train_val_gap(
model: nn.Module,
X_train: torch.Tensor, y_train: torch.Tensor,
X_val: torch.Tensor, y_val: torch.Tensor,
criterion: nn.Module,
) -> None:
model.eval()
with torch.no_grad():
train_loss = criterion(model(X_train).squeeze(), y_train).item()
val_loss = criterion(model(X_val).squeeze(), y_val).item()
gap = val_loss - train_loss
if train_loss > 0.5:
print(f"UNDERFIT: train={train_loss:.4f}, val={val_loss:.4f}")
elif gap > 0.1:
print(f"OVERFIT: train={train_loss:.4f}, val={val_loss:.4f}, gap={gap:.4f}")
else:
print(f"HEALTHY: train={train_loss:.4f}, val={val_loss:.4f}, gap={gap:.4f}")Q4: How do you decide the architecture for a new task?
Answer: A five-step process: (1) Match inductive bias to data structure — CNNs for images/signals, Transformers for sequences, MLPs for tabular; (2) Start small — 2–3 layers, 64–128 neurons, Dropout 0.3, AdamW; (3) Establish a baseline — train for 20 epochs, check if loss decreases; (4) Diagnose — if underfitting, increase capacity; if overfitting, add regularisation; (5) Iterate — ablations over depth/width, not random search. For clinical tabular data with 10–100 features and 10K–100K samples, a 3-layer MLP [128, 64, 32] with BatchNorm and Dropout 0.3 is an excellent starting point.
import torch.nn as nn
def build_clinical_mlp(
n_features: int,
n_samples: int,
task: str = "binary",
) -> nn.Module:
"""Architecture based on dataset size heuristics."""
if n_samples < 5_000:
hidden = [64, 32]
dropout = 0.4
elif n_samples < 50_000:
hidden = [128, 64, 32]
dropout = 0.3
else:
hidden = [256, 128, 64, 32]
dropout = 0.2
n_out = 1 # binary or regression
dims = [n_features] + hidden + [n_out]
layers = []
for in_d, out_d in zip(dims[:-2], dims[1:-1]):
layers.extend([
nn.Linear(in_d, out_d),
nn.BatchNorm1d(out_d),
nn.ReLU(),
nn.Dropout(dropout),
])
layers.append(nn.Linear(hidden[-1], n_out))
return nn.Sequential(*layers)
model = build_clinical_mlp(n_features=20, n_samples=15_000)
n_params = sum(p.numel() for p in model.parameters())
print(f"Params: {n_params:,}")Q5: Why does training loss sometimes spike mid-training?
Answer: Four common causes: (1) Learning rate too high — the model overshoots the minimum; fix with gradient clipping and a scheduler. (2) Bad batch — a batch with extreme outliers causes a large gradient update; check preprocessing for unnormalised features. (3) Gradient explosion — gradients grow exponentially in deep networks without clipping; torch.nn.utils.clip_grad_norm_ prevents this. (4) BatchNorm in wrong mode — calling model.eval() inside the training loop freezes BatchNorm statistics; ensure model.train() during training and model.eval() only for validation.
import torch
import torch.nn as nn
def safe_training_step(
model: nn.Module,
X: torch.Tensor, y: torch.Tensor,
optimizer: torch.optim.Optimizer,
criterion: nn.Module,
) -> dict:
model.train() # CRITICAL: ensure train mode
optimizer.zero_grad()
loss = criterion(model(X).squeeze(), y)
loss.backward()
grad_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
if torch.isnan(loss) or torch.isinf(loss):
print(f"WARNING: loss={loss.item()}, skipping step")
optimizer.zero_grad()
return {"loss": float("nan"), "grad_norm": float("nan")}
optimizer.step()
return {"loss": loss.item(), "grad_norm": grad_norm.item()}Q6: How do you interpret a model's predictions for clinical use?
Answer: Three layers of interpretation are needed: (1) Calibration — do predicted probabilities match true frequencies? A readmission model predicting 80% should be right ~80% of the time. Use Expected Calibration Error (ECE) and reliability diagrams; recalibrate with Platt scaling or temperature scaling if needed. (2) Threshold selection — the default 0.5 threshold is rarely optimal for clinical use; choose threshold based on the desired sensitivity/specificity trade-off, informed by clinical consequences. (3) Feature attribution — use SHAP or integrated gradients to explain individual predictions for clinical review.
import torch
import torch.nn as nn
def calibrate_temperature(
model: nn.Module,
X_val: torch.Tensor,
y_val: torch.Tensor,
criterion: nn.Module,
) -> float:
"""Find temperature T that minimises NLL on validation set (temperature scaling)."""
model.eval()
with torch.no_grad():
logits = model(X_val).squeeze()
temperature = torch.nn.Parameter(torch.ones(1))
opt = torch.optim.LBFGS([temperature], lr=0.01, max_iter=50)
def closure():
opt.zero_grad()
scaled_logits = logits / temperature
loss = criterion(scaled_logits, y_val)
loss.backward()
return loss
opt.step(closure)
t = temperature.item()
print(f"Optimal temperature: {t:.4f}")
return t
# At inference: divide logits by temperature before sigmoid
# If T > 1: model was overconfident (probabilities pushed toward 0.5)
# If T < 1: model was underconfident (probabilities pushed toward extremes)Interview Answer
"Neural networks learn by iterating: forward pass (compute predictions), loss (measure error), backward pass (compute gradients via autograd), optimiser step (update weights). Activation functions are mandatory — without them, depth adds nothing (linear composites are linear). Underfitting = both losses high (too simple); overfitting = val loss >> train loss (too complex). Architecture selection: start small, diagnose the gap, iterate. Common training failures: gradient explosion (fix with clip_grad_norm), bad batches from unnormalised inputs (fix with feature standardisation), and BatchNorm in wrong mode (always model.train() during training). For clinical deployment: validate calibration (ECE), set clinical thresholds based on consequence analysis, and provide feature attribution (SHAP) to support clinician review."