Learnixo
Back to blog
AI Systemsintermediate

Neural Networks β€” Interview Q&A

Six key interview questions on MLP architecture, capacity, the forward pass, loss functions, optimisers, and debugging training failures.

Asma Hafeez KhanMay 22, 20266 min read
Deep LearningNeural NetworksMLPArchitectureInterview
Share:𝕏

Q1: How does a neural network learn?

Answer: A neural network learns by iteratively reducing a loss function through gradient descent. The forward pass propagates input through layers (Z = XΒ·W + b, then activation), producing predictions. The loss function measures prediction error against ground truth. The backward pass uses autograd (chain rule) to compute dL/dW for each parameter. The optimiser (AdamW) updates weights: W ← W - Ξ±Β·grad. Repeated over many batches, the weights converge to values that minimise training loss.

Python
import torch
import torch.nn as nn

model = nn.Sequential(nn.Linear(10, 32), nn.ReLU(), nn.Linear(32, 1))
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4)
criterion = nn.BCEWithLogitsLoss()

X = torch.randn(32, 10)
y = torch.randint(0, 2, (32,)).float()

# The four steps of learning
optimizer.zero_grad()                        # 1. clear previous gradients
loss = criterion(model(X).squeeze(), y)      # 2. forward + loss
loss.backward()                              # 3. backward (compute dL/dW)
optimizer.step()                             # 4. update weights

Q2: Why do we need activation functions?

Answer: Without activation functions, any number of linear layers collapse into a single linear transformation: Layer2(Layer1(x)) = W2(W1x + b1) + b2 = (W2W1)x + (W2b1 + b2) β€” still linear. Non-linear activations break this, allowing the network to approximate non-linear functions. This is required for any task beyond linearly separable data (e.g., XOR is not linearly separable). The Universal Approximation Theorem states that a single hidden layer with a non-polynomial activation can approximate any continuous function. ReLU is the standard hidden-layer activation because it doesn't saturate for positive inputs (no vanishing gradient) and is computationally efficient.

Python
import torch
import torch.nn as nn

# Without activation: just a linear model regardless of depth
linear_stack = nn.Sequential(nn.Linear(10, 32), nn.Linear(32, 1))
# This is equivalent to nn.Linear(10, 1) β€” depth adds nothing

# With activation: can learn non-linear boundaries
nonlinear_stack = nn.Sequential(nn.Linear(10, 32), nn.ReLU(), nn.Linear(32, 1))

# Verify: stack of linear layers is still linear
W_equiv = linear_stack[1].weight @ linear_stack[0].weight
print(f"Weight product shape: {W_equiv.shape}")  # (1, 10) β€” one linear layer

Q3: What is the difference between underfitting and overfitting?

Answer: Underfitting: the model is too simple to capture the true pattern β€” high training loss AND high validation loss. Causes: insufficient capacity, too few epochs, too much regularisation. Fix: increase capacity (more layers/neurons), train longer, reduce regularisation. Overfitting: the model has memorised training noise β€” low training loss but HIGH validation loss (large gap). Causes: too much capacity relative to data, insufficient regularisation. Fix: add Dropout, weight decay, data augmentation, or reduce model size. The ideal is a small train/val gap with both losses low.

Python
import torch
import torch.nn as nn

def check_train_val_gap(
    model: nn.Module,
    X_train: torch.Tensor, y_train: torch.Tensor,
    X_val:   torch.Tensor, y_val:   torch.Tensor,
    criterion: nn.Module,
) -> None:
    model.eval()
    with torch.no_grad():
        train_loss = criterion(model(X_train).squeeze(), y_train).item()
        val_loss   = criterion(model(X_val).squeeze(), y_val).item()
    
    gap = val_loss - train_loss
    if train_loss > 0.5:
        print(f"UNDERFIT: train={train_loss:.4f}, val={val_loss:.4f}")
    elif gap > 0.1:
        print(f"OVERFIT:  train={train_loss:.4f}, val={val_loss:.4f}, gap={gap:.4f}")
    else:
        print(f"HEALTHY:  train={train_loss:.4f}, val={val_loss:.4f}, gap={gap:.4f}")

Q4: How do you decide the architecture for a new task?

Answer: A five-step process: (1) Match inductive bias to data structure β€” CNNs for images/signals, Transformers for sequences, MLPs for tabular; (2) Start small β€” 2–3 layers, 64–128 neurons, Dropout 0.3, AdamW; (3) Establish a baseline β€” train for 20 epochs, check if loss decreases; (4) Diagnose β€” if underfitting, increase capacity; if overfitting, add regularisation; (5) Iterate β€” ablations over depth/width, not random search. For clinical tabular data with 10–100 features and 10K–100K samples, a 3-layer MLP [128, 64, 32] with BatchNorm and Dropout 0.3 is an excellent starting point.

Python
import torch.nn as nn

def build_clinical_mlp(
    n_features: int,
    n_samples: int,
    task: str = "binary",
) -> nn.Module:
    """Architecture based on dataset size heuristics."""
    if n_samples < 5_000:
        hidden = [64, 32]
        dropout = 0.4
    elif n_samples < 50_000:
        hidden = [128, 64, 32]
        dropout = 0.3
    else:
        hidden = [256, 128, 64, 32]
        dropout = 0.2
    
    n_out = 1  # binary or regression
    dims = [n_features] + hidden + [n_out]
    layers = []
    
    for in_d, out_d in zip(dims[:-2], dims[1:-1]):
        layers.extend([
            nn.Linear(in_d, out_d),
            nn.BatchNorm1d(out_d),
            nn.ReLU(),
            nn.Dropout(dropout),
        ])
    layers.append(nn.Linear(hidden[-1], n_out))
    return nn.Sequential(*layers)

model = build_clinical_mlp(n_features=20, n_samples=15_000)
n_params = sum(p.numel() for p in model.parameters())
print(f"Params: {n_params:,}")

Q5: Why does training loss sometimes spike mid-training?

Answer: Four common causes: (1) Learning rate too high β€” the model overshoots the minimum; fix with gradient clipping and a scheduler. (2) Bad batch β€” a batch with extreme outliers causes a large gradient update; check preprocessing for unnormalised features. (3) Gradient explosion β€” gradients grow exponentially in deep networks without clipping; torch.nn.utils.clip_grad_norm_ prevents this. (4) BatchNorm in wrong mode β€” calling model.eval() inside the training loop freezes BatchNorm statistics; ensure model.train() during training and model.eval() only for validation.

Python
import torch
import torch.nn as nn

def safe_training_step(
    model: nn.Module,
    X: torch.Tensor, y: torch.Tensor,
    optimizer: torch.optim.Optimizer,
    criterion: nn.Module,
) -> dict:
    model.train()        # CRITICAL: ensure train mode
    optimizer.zero_grad()
    
    loss = criterion(model(X).squeeze(), y)
    loss.backward()
    
    grad_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
    
    if torch.isnan(loss) or torch.isinf(loss):
        print(f"WARNING: loss={loss.item()}, skipping step")
        optimizer.zero_grad()
        return {"loss": float("nan"), "grad_norm": float("nan")}
    
    optimizer.step()
    return {"loss": loss.item(), "grad_norm": grad_norm.item()}

Q6: How do you interpret a model's predictions for clinical use?

Answer: Three layers of interpretation are needed: (1) Calibration β€” do predicted probabilities match true frequencies? A readmission model predicting 80% should be right ~80% of the time. Use Expected Calibration Error (ECE) and reliability diagrams; recalibrate with Platt scaling or temperature scaling if needed. (2) Threshold selection β€” the default 0.5 threshold is rarely optimal for clinical use; choose threshold based on the desired sensitivity/specificity trade-off, informed by clinical consequences. (3) Feature attribution β€” use SHAP or integrated gradients to explain individual predictions for clinical review.

Python
import torch
import torch.nn as nn

def calibrate_temperature(
    model: nn.Module,
    X_val: torch.Tensor,
    y_val: torch.Tensor,
    criterion: nn.Module,
) -> float:
    """Find temperature T that minimises NLL on validation set (temperature scaling)."""
    model.eval()
    
    with torch.no_grad():
        logits = model(X_val).squeeze()
    
    temperature = torch.nn.Parameter(torch.ones(1))
    opt = torch.optim.LBFGS([temperature], lr=0.01, max_iter=50)
    
    def closure():
        opt.zero_grad()
        scaled_logits = logits / temperature
        loss = criterion(scaled_logits, y_val)
        loss.backward()
        return loss
    
    opt.step(closure)
    t = temperature.item()
    print(f"Optimal temperature: {t:.4f}")
    return t

# At inference: divide logits by temperature before sigmoid
# If T > 1: model was overconfident (probabilities pushed toward 0.5)
# If T < 1: model was underconfident (probabilities pushed toward extremes)

Interview Answer

"Neural networks learn by iterating: forward pass (compute predictions), loss (measure error), backward pass (compute gradients via autograd), optimiser step (update weights). Activation functions are mandatory β€” without them, depth adds nothing (linear composites are linear). Underfitting = both losses high (too simple); overfitting = val loss >> train loss (too complex). Architecture selection: start small, diagnose the gap, iterate. Common training failures: gradient explosion (fix with clip_grad_norm), bad batches from unnormalised inputs (fix with feature standardisation), and BatchNorm in wrong mode (always model.train() during training). For clinical deployment: validate calibration (ECE), set clinical thresholds based on consequence analysis, and provide feature attribution (SHAP) to support clinician review."

Enjoyed this article?

Explore the AI Systems learning path for more.

Found this helpful?

Share:𝕏

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.