Dropout Regularisation
How dropout works, the inverted dropout implementation, MC dropout for uncertainty estimation, and when to use it.
The Idea
During each forward pass in training, randomly set a fraction p of neuron outputs to zero:
Without dropout:
All 256 neurons in layer 2 process every example
Some neurons may co-adapt — neuron A only works when neuron B is active
Co-adaptation → memorising training data
With dropout (p=0.3):
Each forward pass: randomly zero ~77 of 256 neurons
Different neurons are dropped each time
Neurons must work independently — can't rely on other neurons being present
Forces learning of redundant, robust representationsThe Math: Inverted Dropout
import numpy as np
import torch
import torch.nn as nn
# Manual inverted dropout (PyTorch style)
def dropout(x: np.ndarray, p: float, training: bool) -> np.ndarray:
"""
p = probability of dropping (zeroing) a neuron
Inverted dropout: scales remaining neurons by 1/(1-p) during training
so that expected activation magnitude is unchanged at test time.
"""
if not training:
return x # no dropout at inference
mask = (np.random.rand(*x.shape) > p) # True where we KEEP the neuron
return x * mask / (1 - p) # scale by 1/(1-p)
# Why scale by 1/(1-p)?
# If p=0.5, half the neurons are zeroed
# Expected value of a kept neuron's output: 0.5 × original (prob 0.5 of being kept)
# To maintain expected value: multiply surviving neurons by 1/(1-0.5) = 2
# This means no scaling needed at test time → simple inference
# PyTorch built-in dropout
dropout_layer = nn.Dropout(p=0.3) # drop 30% of neurons
x = torch.ones(4, 256)
y_train = dropout_layer(x) # zeros ~30% and scales remainder by 1/0.7
y_eval = dropout_layer.eval()(x) # no-op at eval time (all ones)
print(f"Training output: {y_train[0, :5]}") # mix of 0 and ~1.43
print(f"Eval output: {y_eval[0, :5]}") # all 1.0Where to Apply Dropout
class MLPWithDropout(nn.Module):
def __init__(self, d_in: int, d_out: int):
super().__init__()
self.net = nn.Sequential(
nn.Linear(d_in, 512),
nn.ReLU(),
nn.Dropout(0.3), # after activation, before next layer
nn.Linear(512, 256),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(256, d_out),
# NO dropout after the final layer
)
def forward(self, x):
return self.net(x)
class TransformerWithDropout(nn.Module):
"""Dropout in attention layers follows different conventions."""
def __init__(self, d_model: int, n_heads: int):
super().__init__()
self.attention = nn.MultiheadAttention(
d_model, n_heads,
dropout=0.1, # attention weight dropout
)
self.dropout = nn.Dropout(0.1) # applied to attention output
self.norm = nn.LayerNorm(d_model)
def forward(self, x, mask=None):
attn_out, _ = self.attention(x, x, x, attn_mask=mask)
return self.norm(x + self.dropout(attn_out))Dropout Rate Guidelines
p = 0.0: no dropout (baseline)
p = 0.1: light regularisation — Transformers, attention layers
p = 0.2: mild regularisation — CNNs
p = 0.3–0.5: standard for MLPs
p = 0.5: strong regularisation — original Hinton et al. recommendation for MLPs
p > 0.5: rarely used — too much information destroyed
Where NOT to use dropout:
BatchNorm layers — dropout + BN interact poorly
LSTM/GRU (use recurrent dropout instead, different formulation)
Small networks — not enough neurons to drop
Immediately before the output layer (final layer)MC Dropout: Bayesian Uncertainty
Monte Carlo dropout uses dropout at inference to estimate prediction uncertainty:
class BayesianMLP(nn.Module):
def __init__(self, d_in: int, d_out: int, p: float = 0.1):
super().__init__()
self.net = nn.Sequential(
nn.Linear(d_in, 256),
nn.ReLU(),
nn.Dropout(p), # stays ACTIVE at inference
nn.Linear(256, d_out),
)
def forward(self, x):
return self.net(x)
def mc_dropout_predict(
model: BayesianMLP,
x: torch.Tensor,
n_samples: int = 50,
) -> tuple[torch.Tensor, torch.Tensor]:
model.train() # keep dropout active (not model.eval()!)
predictions = torch.stack([
torch.sigmoid(model(x))
for _ in range(n_samples)
]) # shape: (n_samples, batch, n_out)
mean = predictions.mean(dim=0) # expected prediction
std = predictions.std(dim=0) # epistemic uncertainty
return mean, std
# Usage in clinical AI: flag uncertain predictions for review
x_test = torch.randn(100, 50)
mean_pred, uncertainty = mc_dropout_predict(model, x_test)
uncertain_cases = (uncertainty > 0.15).squeeze()
print(f"Uncertain cases: {uncertain_cases.sum()} / 100")Ensemble Interpretation of Dropout
Dropout at training time trains an ensemble of 2^n possible networks
(each dropout mask = a different subnetwork)
At inference:
Standard dropout (eval mode): use full network = approximate ensemble average
MC dropout (train mode): sample from the ensemble = get mean and variance
The variance across samples = epistemic uncertainty
High variance: the ensemble disagrees → model is uncertain
Low variance: ensemble agrees → model is confident
Clinical use case:
P(readmission) = 0.82, uncertainty std = 0.03 → high confidence → act on it
P(readmission) = 0.78, uncertainty std = 0.18 → low confidence → flag for clinician reviewInterview Answer
"Dropout randomly zeroes a fraction p of neuron outputs each forward pass during training, forcing neurons to learn independent representations and preventing co-adaptation. Inverted dropout scales remaining activations by 1/(1-p) during training, so no adjustment is needed at inference. Standard practice: p=0.1–0.2 for attention layers in Transformers, p=0.3–0.5 for MLP hidden layers. MC Dropout extends this to uncertainty estimation: keep dropout active at inference and run N forward passes — the variance of predictions across passes measures epistemic uncertainty, useful for flagging cases for clinical review. Dropout interacts poorly with BatchNorm — use one or the other in the same block."
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.