Deep Learning for AI Interviews · Lesson 11 of 56

Dropout: How It Prevents Overfitting

The Idea

During each forward pass in training, randomly set a fraction p of neuron outputs to zero:

Without dropout:
  All 256 neurons in layer 2 process every example
  Some neurons may co-adapt — neuron A only works when neuron B is active
  Co-adaptation → memorising training data

With dropout (p=0.3):
  Each forward pass: randomly zero ~77 of 256 neurons
  Different neurons are dropped each time
  Neurons must work independently — can't rely on other neurons being present
  Forces learning of redundant, robust representations

The Math: Inverted Dropout

Python

import numpy as np
import torch
import torch.nn as nn

# Manual inverted dropout (PyTorch style)
def dropout(x: np.ndarray, p: float, training: bool) -> np.ndarray:
    """
    p = probability of dropping (zeroing) a neuron
    Inverted dropout: scales remaining neurons by 1/(1-p) during training
    so that expected activation magnitude is unchanged at test time.
    """
    if not training:
        return x   # no dropout at inference
    
    mask = (np.random.rand(*x.shape) > p)   # True where we KEEP the neuron
    return x * mask / (1 - p)    # scale by 1/(1-p)

# Why scale by 1/(1-p)?
# If p=0.5, half the neurons are zeroed
# Expected value of a kept neuron's output: 0.5 × original (prob 0.5 of being kept)
# To maintain expected value: multiply surviving neurons by 1/(1-0.5) = 2
# This means no scaling needed at test time → simple inference


# PyTorch built-in dropout
dropout_layer = nn.Dropout(p=0.3)   # drop 30% of neurons

x = torch.ones(4, 256)
y_train = dropout_layer(x)   # zeros ~30% and scales remainder by 1/0.7
y_eval  = dropout_layer.eval()(x)  # no-op at eval time (all ones)

print(f"Training output: {y_train[0, :5]}")   # mix of 0 and ~1.43
print(f"Eval output:     {y_eval[0, :5]}")    # all 1.0

Where to Apply Dropout

Python

class MLPWithDropout(nn.Module):
    def __init__(self, d_in: int, d_out: int):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(d_in, 512),
            nn.ReLU(),
            nn.Dropout(0.3),            # after activation, before next layer
            nn.Linear(512, 256),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(256, d_out),
            # NO dropout after the final layer
        )
    
    def forward(self, x):
        return self.net(x)


class TransformerWithDropout(nn.Module):
    """Dropout in attention layers follows different conventions."""
    def __init__(self, d_model: int, n_heads: int):
        super().__init__()
        self.attention = nn.MultiheadAttention(
            d_model, n_heads,
            dropout=0.1,    # attention weight dropout
        )
        self.dropout = nn.Dropout(0.1)  # applied to attention output
        self.norm = nn.LayerNorm(d_model)
    
    def forward(self, x, mask=None):
        attn_out, _ = self.attention(x, x, x, attn_mask=mask)
        return self.norm(x + self.dropout(attn_out))

Dropout Rate Guidelines

p = 0.0:  no dropout (baseline)
p = 0.1:  light regularisation — Transformers, attention layers
p = 0.2:  mild regularisation — CNNs
p = 0.3–0.5: standard for MLPs
p = 0.5:  strong regularisation — original Hinton et al. recommendation for MLPs
p > 0.5:  rarely used — too much information destroyed

Where NOT to use dropout:
  BatchNorm layers — dropout + BN interact poorly
  LSTM/GRU (use recurrent dropout instead, different formulation)
  Small networks — not enough neurons to drop
  Immediately before the output layer (final layer)

MC Dropout: Bayesian Uncertainty

Monte Carlo dropout uses dropout at inference to estimate prediction uncertainty:

Python

class BayesianMLP(nn.Module):
    def __init__(self, d_in: int, d_out: int, p: float = 0.1):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(d_in, 256),
            nn.ReLU(),
            nn.Dropout(p),    # stays ACTIVE at inference
            nn.Linear(256, d_out),
        )
    
    def forward(self, x):
        return self.net(x)

def mc_dropout_predict(
    model: BayesianMLP,
    x: torch.Tensor,
    n_samples: int = 50,
) -> tuple[torch.Tensor, torch.Tensor]:
    model.train()   # keep dropout active (not model.eval()!)
    
    predictions = torch.stack([
        torch.sigmoid(model(x))
        for _ in range(n_samples)
    ])  # shape: (n_samples, batch, n_out)
    
    mean = predictions.mean(dim=0)   # expected prediction
    std  = predictions.std(dim=0)    # epistemic uncertainty
    return mean, std

# Usage in clinical AI: flag uncertain predictions for review
x_test = torch.randn(100, 50)
mean_pred, uncertainty = mc_dropout_predict(model, x_test)
uncertain_cases = (uncertainty > 0.15).squeeze()
print(f"Uncertain cases: {uncertain_cases.sum()} / 100")

Ensemble Interpretation of Dropout

Dropout at training time trains an ensemble of 2^n possible networks
(each dropout mask = a different subnetwork)

At inference:
  Standard dropout (eval mode): use full network = approximate ensemble average
  MC dropout (train mode): sample from the ensemble = get mean and variance

The variance across samples = epistemic uncertainty
  High variance: the ensemble disagrees → model is uncertain
  Low variance:  ensemble agrees → model is confident

Clinical use case:
  P(readmission) = 0.82, uncertainty std = 0.03 → high confidence → act on it
  P(readmission) = 0.78, uncertainty std = 0.18 → low confidence → flag for clinician review

Interview Answer

"Dropout randomly zeroes a fraction p of neuron outputs each forward pass during training, forcing neurons to learn independent representations and preventing co-adaptation. Inverted dropout scales remaining activations by 1/(1-p) during training, so no adjustment is needed at inference. Standard practice: p=0.1–0.2 for attention layers in Transformers, p=0.3–0.5 for MLP hidden layers. MC Dropout extends this to uncertainty estimation: keep dropout active at inference and run N forward passes — the variance of predictions across passes measures epistemic uncertainty, useful for flagging cases for clinical review. Dropout interacts poorly with BatchNorm — use one or the other in the same block."

Overfitting in Deep Networks

Next Lesson

Batch Norm, Early Stopping, and Data Augmentation