Deep Learning for AI Interviews · Lesson 32 of 56

Interview: Design a Neural Network for a Given Task

Q1: How does a neural network learn?

Answer: A neural network learns by iteratively reducing a loss function through gradient descent. The forward pass propagates input through layers (Z = X·W + b, then activation), producing predictions. The loss function measures prediction error against ground truth. The backward pass uses autograd (chain rule) to compute dL/dW for each parameter. The optimiser (AdamW) updates weights: W ← W - α·grad. Repeated over many batches, the weights converge to values that minimise training loss.

Python

import torch
import torch.nn as nn

model = nn.Sequential(nn.Linear(10, 32), nn.ReLU(), nn.Linear(32, 1))
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4)
criterion = nn.BCEWithLogitsLoss()

X = torch.randn(32, 10)
y = torch.randint(0, 2, (32,)).float()

# The four steps of learning
optimizer.zero_grad()                        # 1. clear previous gradients
loss = criterion(model(X).squeeze(), y)      # 2. forward + loss
loss.backward()                              # 3. backward (compute dL/dW)
optimizer.step()                             # 4. update weights

Q2: Why do we need activation functions?

Answer: Without activation functions, any number of linear layers collapse into a single linear transformation: Layer2(Layer1(x)) = W2(W1x + b1) + b2 = (W2W1)x + (W2b1 + b2) — still linear. Non-linear activations break this, allowing the network to approximate non-linear functions. This is required for any task beyond linearly separable data (e.g., XOR is not linearly separable). The Universal Approximation Theorem states that a single hidden layer with a non-polynomial activation can approximate any continuous function. ReLU is the standard hidden-layer activation because it doesn't saturate for positive inputs (no vanishing gradient) and is computationally efficient.

Python

import torch
import torch.nn as nn

# Without activation: just a linear model regardless of depth
linear_stack = nn.Sequential(nn.Linear(10, 32), nn.Linear(32, 1))
# This is equivalent to nn.Linear(10, 1) — depth adds nothing

# With activation: can learn non-linear boundaries
nonlinear_stack = nn.Sequential(nn.Linear(10, 32), nn.ReLU(), nn.Linear(32, 1))

# Verify: stack of linear layers is still linear
W_equiv = linear_stack[1].weight @ linear_stack[0].weight
print(f"Weight product shape: {W_equiv.shape}")  # (1, 10) — one linear layer

Q3: What is the difference between underfitting and overfitting?

Answer: Underfitting: the model is too simple to capture the true pattern — high training loss AND high validation loss. Causes: insufficient capacity, too few epochs, too much regularisation. Fix: increase capacity (more layers/neurons), train longer, reduce regularisation. Overfitting: the model has memorised training noise — low training loss but HIGH validation loss (large gap). Causes: too much capacity relative to data, insufficient regularisation. Fix: add Dropout, weight decay, data augmentation, or reduce model size. The ideal is a small train/val gap with both losses low.

Python

import torch
import torch.nn as nn

def check_train_val_gap(
    model: nn.Module,
    X_train: torch.Tensor, y_train: torch.Tensor,
    X_val:   torch.Tensor, y_val:   torch.Tensor,
    criterion: nn.Module,
) -> None:
    model.eval()
    with torch.no_grad():
        train_loss = criterion(model(X_train).squeeze(), y_train).item()
        val_loss   = criterion(model(X_val).squeeze(), y_val).item()
    
    gap = val_loss - train_loss
    if train_loss > 0.5:
        print(f"UNDERFIT: train={train_loss:.4f}, val={val_loss:.4f}")
    elif gap > 0.1:
        print(f"OVERFIT:  train={train_loss:.4f}, val={val_loss:.4f}, gap={gap:.4f}")
    else:
        print(f"HEALTHY:  train={train_loss:.4f}, val={val_loss:.4f}, gap={gap:.4f}")

Q4: How do you decide the architecture for a new task?

Answer: A five-step process: (1) Match inductive bias to data structure — CNNs for images/signals, Transformers for sequences, MLPs for tabular; (2) Start small — 2–3 layers, 64–128 neurons, Dropout 0.3, AdamW; (3) Establish a baseline — train for 20 epochs, check if loss decreases; (4) Diagnose — if underfitting, increase capacity; if overfitting, add regularisation; (5) Iterate — ablations over depth/width, not random search. For clinical tabular data with 10–100 features and 10K–100K samples, a 3-layer MLP [128, 64, 32] with BatchNorm and Dropout 0.3 is an excellent starting point.

Python

import torch.nn as nn

def build_clinical_mlp(
    n_features: int,
    n_samples: int,
    task: str = "binary",
) -> nn.Module:
    """Architecture based on dataset size heuristics."""
    if n_samples < 5_000:
        hidden = [64, 32]
        dropout = 0.4
    elif n_samples < 50_000:
        hidden = [128, 64, 32]
        dropout = 0.3
    else:
        hidden = [256, 128, 64, 32]
        dropout = 0.2
    
    n_out = 1  # binary or regression
    dims = [n_features] + hidden + [n_out]
    layers = []
    
    for in_d, out_d in zip(dims[:-2], dims[1:-1]):
        layers.extend([
            nn.Linear(in_d, out_d),
            nn.BatchNorm1d(out_d),
            nn.ReLU(),
            nn.Dropout(dropout),
        ])
    layers.append(nn.Linear(hidden[-1], n_out))
    return nn.Sequential(*layers)

model = build_clinical_mlp(n_features=20, n_samples=15_000)
n_params = sum(p.numel() for p in model.parameters())
print(f"Params: {n_params:,}")

Q5: Why does training loss sometimes spike mid-training?

Answer: Four common causes: (1) Learning rate too high — the model overshoots the minimum; fix with gradient clipping and a scheduler. (2) Bad batch — a batch with extreme outliers causes a large gradient update; check preprocessing for unnormalised features. (3) Gradient explosion — gradients grow exponentially in deep networks without clipping; torch.nn.utils.clip_grad_norm_ prevents this. (4) BatchNorm in wrong mode — calling model.eval() inside the training loop freezes BatchNorm statistics; ensure model.train() during training and model.eval() only for validation.

Python

import torch
import torch.nn as nn

def safe_training_step(
    model: nn.Module,
    X: torch.Tensor, y: torch.Tensor,
    optimizer: torch.optim.Optimizer,
    criterion: nn.Module,
) -> dict:
    model.train()        # CRITICAL: ensure train mode
    optimizer.zero_grad()
    
    loss = criterion(model(X).squeeze(), y)
    loss.backward()
    
    grad_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
    
    if torch.isnan(loss) or torch.isinf(loss):
        print(f"WARNING: loss={loss.item()}, skipping step")
        optimizer.zero_grad()
        return {"loss": float("nan"), "grad_norm": float("nan")}
    
    optimizer.step()
    return {"loss": loss.item(), "grad_norm": grad_norm.item()}

Q6: How do you interpret a model's predictions for clinical use?

Answer: Three layers of interpretation are needed: (1) Calibration — do predicted probabilities match true frequencies? A readmission model predicting 80% should be right ~80% of the time. Use Expected Calibration Error (ECE) and reliability diagrams; recalibrate with Platt scaling or temperature scaling if needed. (2) Threshold selection — the default 0.5 threshold is rarely optimal for clinical use; choose threshold based on the desired sensitivity/specificity trade-off, informed by clinical consequences. (3) Feature attribution — use SHAP or integrated gradients to explain individual predictions for clinical review.

Python

import torch
import torch.nn as nn

def calibrate_temperature(
    model: nn.Module,
    X_val: torch.Tensor,
    y_val: torch.Tensor,
    criterion: nn.Module,
) -> float:
    """Find temperature T that minimises NLL on validation set (temperature scaling)."""
    model.eval()
    
    with torch.no_grad():
        logits = model(X_val).squeeze()
    
    temperature = torch.nn.Parameter(torch.ones(1))
    opt = torch.optim.LBFGS([temperature], lr=0.01, max_iter=50)
    
    def closure():
        opt.zero_grad()
        scaled_logits = logits / temperature
        loss = criterion(scaled_logits, y_val)
        loss.backward()
        return loss
    
    opt.step(closure)
    t = temperature.item()
    print(f"Optimal temperature: {t:.4f}")
    return t

# At inference: divide logits by temperature before sigmoid
# If T > 1: model was overconfident (probabilities pushed toward 0.5)
# If T < 1: model was underconfident (probabilities pushed toward extremes)

Interview Answer

"Neural networks learn by iterating: forward pass (compute predictions), loss (measure error), backward pass (compute gradients via autograd), optimiser step (update weights). Activation functions are mandatory — without them, depth adds nothing (linear composites are linear). Underfitting = both losses high (too simple); overfitting = val loss >> train loss (too complex). Architecture selection: start small, diagnose the gap, iterate. Common training failures: gradient explosion (fix with clip_grad_norm), bad batches from unnormalised inputs (fix with feature standardisation), and BatchNorm in wrong mode (always model.train() during training). For clinical deployment: validate calibration (ECE), set clinical thresholds based on consequence analysis, and provide feature attribution (SHAP) to support clinician review."

Network Capacity and Expressiveness

Next Lesson

Backpropagation Explained Step by Step