Probability Distributions in Machine Learning
How specific probability distributions appear inside ML models — loss functions, outputs, regularisation, and generative models.
The Connection: Loss Functions as Log-Likelihoods
Every standard loss function is the negative log-likelihood of an assumed output distribution:
MSE loss:
Assumes: y ~ Normal(ŷ, σ²) (Gaussian errors)
Loss = -(1/n) log P(y | ŷ) ∝ Σ (yᵢ - ŷᵢ)²
Minimising MSE = Maximum Likelihood Estimation under Gaussian noise
Binary cross-entropy:
Assumes: y ~ Bernoulli(p̂) (binary outcome)
Loss = -[y log(p̂) + (1-y) log(1-p̂)]
Minimising BCE = MLE under Bernoulli distribution
Categorical cross-entropy:
Assumes: y ~ Categorical(p̂₁, ..., p̂ₖ) (multi-class)
Loss = -Σₖ yₖ log(p̂ₖ) (one-hot y)
Minimising CE = MLE under categorical distribution
MAE (L1) loss:
Assumes: y ~ Laplace(ŷ, b) (Laplace errors — heavier tails than Gaussian)
Loss = Σ |yᵢ - ŷᵢ|
More robust to outliers than MSE because Laplace has heavier tailsOutput Distributions
import torch
import torch.nn as nn
import torch.nn.functional as F
# Bernoulli: binary classification
class BinaryClassifier(nn.Module):
def __init__(self, d_in):
super().__init__()
self.fc = nn.Linear(d_in, 1)
def forward(self, x):
logit = self.fc(x)
p = torch.sigmoid(logit) # P(y=1 | x) ~ Bernoulli(p)
return p
# Categorical: multi-class classification
class MultiClassifier(nn.Module):
def __init__(self, d_in, n_classes):
super().__init__()
self.fc = nn.Linear(d_in, n_classes)
def forward(self, x):
logits = self.fc(x)
probs = F.softmax(logits, dim=-1) # P(class=k | x) ~ Categorical
return probs
# Gaussian: regression with uncertainty
class GaussianRegressor(nn.Module):
def __init__(self, d_in):
super().__init__()
self.mean_head = nn.Linear(d_in, 1) # predict mean
self.log_std_head = nn.Linear(d_in, 1) # predict log std
def forward(self, x):
mu = self.mean_head(x)
log_sigma = self.log_std_head(x)
sigma = torch.exp(log_sigma.clamp(-5, 5)) # prevent extreme values
return mu, sigma # N(mu, sigma²)
def nll_loss(self, x, y):
mu, sigma = self.forward(x)
dist = torch.distributions.Normal(mu, sigma)
return -dist.log_prob(y).mean()Regularisation as Priors
L2 regularisation = Gaussian prior on weights
loss = MSE + λ Σ wᵢ²
Bayesian view: MAP with prior w ~ N(0, 1/(2λ))
Effect: weights pulled toward 0, variance reduced
L1 regularisation = Laplace prior on weights
loss = MSE + λ Σ |wᵢ|
Bayesian view: MAP with prior w ~ Laplace(0, 1/λ)
Effect: weights pulled toward exactly 0 (sparsity)
Dropout = approximate posterior inference
Each forward pass samples a different "thinned" network
Ensemble of binary masks ~ Bernoulli(1-dropout_p) per weight
At inference: use full network (approximate posterior mean)import torch.nn as nn
# L2 regularisation as weight_decay in optimiser
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-4)
# weight_decay=1e-4 adds λ‖w‖² to the loss (Gaussian prior with σ²=1/(2λ))
# L1 regularisation (not built-in, add manually)
def l1_loss(model, lambda_l1: float = 1e-4) -> torch.Tensor:
return lambda_l1 * sum(p.abs().sum() for p in model.parameters())
# Training step with L1
loss = criterion(outputs, targets) + l1_loss(model)Generative Models and Distributions
# Variational Autoencoder: latent space ~ Normal(μ, σ²)
class VAE(nn.Module):
def __init__(self, d_input: int, d_latent: int):
super().__init__()
self.encoder = nn.Sequential(nn.Linear(d_input, 256), nn.ReLU())
self.mu_head = nn.Linear(256, d_latent)
self.log_var_head = nn.Linear(256, d_latent)
self.decoder = nn.Sequential(
nn.Linear(d_latent, 256), nn.ReLU(),
nn.Linear(256, d_input), nn.Sigmoid(),
)
def encode(self, x):
h = self.encoder(x)
return self.mu_head(h), self.log_var_head(h)
def reparameterise(self, mu, log_var):
std = torch.exp(0.5 * log_var)
eps = torch.randn_like(std) # sample ε ~ N(0, I)
return mu + eps * std # z = μ + σ * ε ~ N(μ, σ²)
def forward(self, x):
mu, log_var = self.encode(x)
z = self.reparameterise(mu, log_var)
x_recon = self.decoder(z)
return x_recon, mu, log_var
def loss(self, x, x_recon, mu, log_var):
# Reconstruction loss: Bernoulli NLL (binary cross-entropy)
recon_loss = F.binary_cross_entropy(x_recon, x, reduction="sum")
# KL divergence: D_KL(N(μ,σ²) || N(0,1))
kl_loss = -0.5 * torch.sum(1 + log_var - mu.pow(2) - log_var.exp())
return recon_loss + kl_lossMixture Distributions
# Gaussian Mixture Model: P(x) = Σₖ πₖ × N(x; μₖ, σₖ²)
from sklearn.mixture import GaussianMixture
# Fit a 2-component GMM
gmm = GaussianMixture(n_components=2, covariance_type="full", random_state=42)
gmm.fit(X_train)
# Score new samples (log P(x))
log_probs = gmm.score_samples(X_test)
# Use for anomaly detection: low P(x) → anomalous
threshold = np.percentile(gmm.score_samples(X_train), 5) # 5th percentile of training
anomalies = X_test[log_probs < threshold]Interview Answer
"Every loss function corresponds to a distributional assumption about the output. MSE assumes Gaussian errors (minimising MSE = MLE under N(ŷ, σ²)); binary cross-entropy assumes Bernoulli outputs (MLE under Bernoulli(p̂)); categorical cross-entropy assumes categorical outputs (MLE under Categorical distribution). Regularisation is a distributional prior: L2 weight decay corresponds to a Gaussian prior on weights (MAP estimation), L1 to a Laplace prior. Generative models like VAEs explicitly model distributions — the latent space is Gaussian by design, and the KL divergence loss term pushes the encoder's output distribution toward the prior. Understanding this connection makes model design principled: choose the distribution whose assumptions match your data."
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.