Learnixo
Back to blog
AI Systemsintermediate

Dropout Regularisation

How dropout works, the inverted dropout implementation, MC dropout for uncertainty estimation, and when to use it.

Asma Hafeez KhanMay 21, 20265 min read
Deep LearningDropoutRegularisationUncertaintyInterview
Share:𝕏

The Idea

During each forward pass in training, randomly set a fraction p of neuron outputs to zero:

Without dropout:
  All 256 neurons in layer 2 process every example
  Some neurons may co-adapt — neuron A only works when neuron B is active
  Co-adaptation → memorising training data

With dropout (p=0.3):
  Each forward pass: randomly zero ~77 of 256 neurons
  Different neurons are dropped each time
  Neurons must work independently — can't rely on other neurons being present
  Forces learning of redundant, robust representations

The Math: Inverted Dropout

Python
import numpy as np
import torch
import torch.nn as nn

# Manual inverted dropout (PyTorch style)
def dropout(x: np.ndarray, p: float, training: bool) -> np.ndarray:
    """
    p = probability of dropping (zeroing) a neuron
    Inverted dropout: scales remaining neurons by 1/(1-p) during training
    so that expected activation magnitude is unchanged at test time.
    """
    if not training:
        return x   # no dropout at inference
    
    mask = (np.random.rand(*x.shape) > p)   # True where we KEEP the neuron
    return x * mask / (1 - p)    # scale by 1/(1-p)

# Why scale by 1/(1-p)?
# If p=0.5, half the neurons are zeroed
# Expected value of a kept neuron's output: 0.5 × original (prob 0.5 of being kept)
# To maintain expected value: multiply surviving neurons by 1/(1-0.5) = 2
# This means no scaling needed at test time → simple inference


# PyTorch built-in dropout
dropout_layer = nn.Dropout(p=0.3)   # drop 30% of neurons

x = torch.ones(4, 256)
y_train = dropout_layer(x)   # zeros ~30% and scales remainder by 1/0.7
y_eval  = dropout_layer.eval()(x)  # no-op at eval time (all ones)

print(f"Training output: {y_train[0, :5]}")   # mix of 0 and ~1.43
print(f"Eval output:     {y_eval[0, :5]}")    # all 1.0

Where to Apply Dropout

Python
class MLPWithDropout(nn.Module):
    def __init__(self, d_in: int, d_out: int):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(d_in, 512),
            nn.ReLU(),
            nn.Dropout(0.3),            # after activation, before next layer
            nn.Linear(512, 256),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(256, d_out),
            # NO dropout after the final layer
        )
    
    def forward(self, x):
        return self.net(x)


class TransformerWithDropout(nn.Module):
    """Dropout in attention layers follows different conventions."""
    def __init__(self, d_model: int, n_heads: int):
        super().__init__()
        self.attention = nn.MultiheadAttention(
            d_model, n_heads,
            dropout=0.1,    # attention weight dropout
        )
        self.dropout = nn.Dropout(0.1)  # applied to attention output
        self.norm = nn.LayerNorm(d_model)
    
    def forward(self, x, mask=None):
        attn_out, _ = self.attention(x, x, x, attn_mask=mask)
        return self.norm(x + self.dropout(attn_out))

Dropout Rate Guidelines

p = 0.0:  no dropout (baseline)
p = 0.1:  light regularisation — Transformers, attention layers
p = 0.2:  mild regularisation — CNNs
p = 0.3–0.5: standard for MLPs
p = 0.5:  strong regularisation — original Hinton et al. recommendation for MLPs
p > 0.5:  rarely used — too much information destroyed

Where NOT to use dropout:
  BatchNorm layers — dropout + BN interact poorly
  LSTM/GRU (use recurrent dropout instead, different formulation)
  Small networks — not enough neurons to drop
  Immediately before the output layer (final layer)

MC Dropout: Bayesian Uncertainty

Monte Carlo dropout uses dropout at inference to estimate prediction uncertainty:

Python
class BayesianMLP(nn.Module):
    def __init__(self, d_in: int, d_out: int, p: float = 0.1):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(d_in, 256),
            nn.ReLU(),
            nn.Dropout(p),    # stays ACTIVE at inference
            nn.Linear(256, d_out),
        )
    
    def forward(self, x):
        return self.net(x)

def mc_dropout_predict(
    model: BayesianMLP,
    x: torch.Tensor,
    n_samples: int = 50,
) -> tuple[torch.Tensor, torch.Tensor]:
    model.train()   # keep dropout active (not model.eval()!)
    
    predictions = torch.stack([
        torch.sigmoid(model(x))
        for _ in range(n_samples)
    ])  # shape: (n_samples, batch, n_out)
    
    mean = predictions.mean(dim=0)   # expected prediction
    std  = predictions.std(dim=0)    # epistemic uncertainty
    return mean, std

# Usage in clinical AI: flag uncertain predictions for review
x_test = torch.randn(100, 50)
mean_pred, uncertainty = mc_dropout_predict(model, x_test)
uncertain_cases = (uncertainty > 0.15).squeeze()
print(f"Uncertain cases: {uncertain_cases.sum()} / 100")

Ensemble Interpretation of Dropout

Dropout at training time trains an ensemble of 2^n possible networks
(each dropout mask = a different subnetwork)

At inference:
  Standard dropout (eval mode): use full network = approximate ensemble average
  MC dropout (train mode): sample from the ensemble = get mean and variance

The variance across samples = epistemic uncertainty
  High variance: the ensemble disagrees → model is uncertain
  Low variance:  ensemble agrees → model is confident

Clinical use case:
  P(readmission) = 0.82, uncertainty std = 0.03 → high confidence → act on it
  P(readmission) = 0.78, uncertainty std = 0.18 → low confidence → flag for clinician review

Interview Answer

"Dropout randomly zeroes a fraction p of neuron outputs each forward pass during training, forcing neurons to learn independent representations and preventing co-adaptation. Inverted dropout scales remaining activations by 1/(1-p) during training, so no adjustment is needed at inference. Standard practice: p=0.1–0.2 for attention layers in Transformers, p=0.3–0.5 for MLP hidden layers. MC Dropout extends this to uncertainty estimation: keep dropout active at inference and run N forward passes — the variance of predictions across passes measures epistemic uncertainty, useful for flagging cases for clinical review. Dropout interacts poorly with BatchNorm — use one or the other in the same block."

Enjoyed this article?

Explore the AI Systems learning path for more.

Found this helpful?

Share:𝕏

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.