Probability Distributions in Machine Learning

The Connection: Loss Functions as Log-Likelihoods

Every standard loss function is the negative log-likelihood of an assumed output distribution:

MSE loss:
  Assumes: y ~ Normal(ŷ, σ²)  (Gaussian errors)
  Loss = -(1/n) log P(y | ŷ) ∝ Σ (yᵢ - ŷᵢ)²
  
  Minimising MSE = Maximum Likelihood Estimation under Gaussian noise

Binary cross-entropy:
  Assumes: y ~ Bernoulli(p̂)  (binary outcome)
  Loss = -[y log(p̂) + (1-y) log(1-p̂)]
  
  Minimising BCE = MLE under Bernoulli distribution

Categorical cross-entropy:
  Assumes: y ~ Categorical(p̂₁, ..., p̂ₖ)  (multi-class)
  Loss = -Σₖ yₖ log(p̂ₖ)   (one-hot y)
  
  Minimising CE = MLE under categorical distribution

MAE (L1) loss:
  Assumes: y ~ Laplace(ŷ, b)  (Laplace errors — heavier tails than Gaussian)
  Loss = Σ |yᵢ - ŷᵢ|
  
  More robust to outliers than MSE because Laplace has heavier tails

Output Distributions

Python

import torch
import torch.nn as nn
import torch.nn.functional as F

# Bernoulli: binary classification
class BinaryClassifier(nn.Module):
    def __init__(self, d_in):
        super().__init__()
        self.fc = nn.Linear(d_in, 1)
    
    def forward(self, x):
        logit = self.fc(x)
        p = torch.sigmoid(logit)   # P(y=1 | x) ~ Bernoulli(p)
        return p

# Categorical: multi-class classification
class MultiClassifier(nn.Module):
    def __init__(self, d_in, n_classes):
        super().__init__()
        self.fc = nn.Linear(d_in, n_classes)
    
    def forward(self, x):
        logits = self.fc(x)
        probs = F.softmax(logits, dim=-1)  # P(class=k | x) ~ Categorical
        return probs

# Gaussian: regression with uncertainty
class GaussianRegressor(nn.Module):
    def __init__(self, d_in):
        super().__init__()
        self.mean_head = nn.Linear(d_in, 1)   # predict mean
        self.log_std_head = nn.Linear(d_in, 1)  # predict log std
    
    def forward(self, x):
        mu = self.mean_head(x)
        log_sigma = self.log_std_head(x)
        sigma = torch.exp(log_sigma.clamp(-5, 5))  # prevent extreme values
        return mu, sigma   # N(mu, sigma²)
    
    def nll_loss(self, x, y):
        mu, sigma = self.forward(x)
        dist = torch.distributions.Normal(mu, sigma)
        return -dist.log_prob(y).mean()

Regularisation as Priors

L2 regularisation = Gaussian prior on weights
  loss = MSE + λ Σ wᵢ²
  Bayesian view: MAP with prior w ~ N(0, 1/(2λ))
  Effect: weights pulled toward 0, variance reduced

L1 regularisation = Laplace prior on weights
  loss = MSE + λ Σ |wᵢ|
  Bayesian view: MAP with prior w ~ Laplace(0, 1/λ)
  Effect: weights pulled toward exactly 0 (sparsity)

Dropout = approximate posterior inference
  Each forward pass samples a different "thinned" network
  Ensemble of binary masks ~ Bernoulli(1-dropout_p) per weight
  At inference: use full network (approximate posterior mean)

Python

import torch.nn as nn

# L2 regularisation as weight_decay in optimiser
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-4)
# weight_decay=1e-4 adds λ‖w‖² to the loss (Gaussian prior with σ²=1/(2λ))

# L1 regularisation (not built-in, add manually)
def l1_loss(model, lambda_l1: float = 1e-4) -> torch.Tensor:
    return lambda_l1 * sum(p.abs().sum() for p in model.parameters())

# Training step with L1
loss = criterion(outputs, targets) + l1_loss(model)

Generative Models and Distributions

Python

# Variational Autoencoder: latent space ~ Normal(μ, σ²)
class VAE(nn.Module):
    def __init__(self, d_input: int, d_latent: int):
        super().__init__()
        self.encoder = nn.Sequential(nn.Linear(d_input, 256), nn.ReLU())
        self.mu_head     = nn.Linear(256, d_latent)
        self.log_var_head = nn.Linear(256, d_latent)
        self.decoder = nn.Sequential(
            nn.Linear(d_latent, 256), nn.ReLU(),
            nn.Linear(256, d_input), nn.Sigmoid(),
        )
    
    def encode(self, x):
        h = self.encoder(x)
        return self.mu_head(h), self.log_var_head(h)
    
    def reparameterise(self, mu, log_var):
        std = torch.exp(0.5 * log_var)
        eps = torch.randn_like(std)   # sample ε ~ N(0, I)
        return mu + eps * std          # z = μ + σ * ε ~ N(μ, σ²)
    
    def forward(self, x):
        mu, log_var = self.encode(x)
        z = self.reparameterise(mu, log_var)
        x_recon = self.decoder(z)
        return x_recon, mu, log_var
    
    def loss(self, x, x_recon, mu, log_var):
        # Reconstruction loss: Bernoulli NLL (binary cross-entropy)
        recon_loss = F.binary_cross_entropy(x_recon, x, reduction="sum")
        
        # KL divergence: D_KL(N(μ,σ²) || N(0,1))
        kl_loss = -0.5 * torch.sum(1 + log_var - mu.pow(2) - log_var.exp())
        
        return recon_loss + kl_loss

Mixture Distributions

Python

# Gaussian Mixture Model: P(x) = Σₖ πₖ × N(x; μₖ, σₖ²)
from sklearn.mixture import GaussianMixture

# Fit a 2-component GMM
gmm = GaussianMixture(n_components=2, covariance_type="full", random_state=42)
gmm.fit(X_train)

# Score new samples (log P(x))
log_probs = gmm.score_samples(X_test)

# Use for anomaly detection: low P(x) → anomalous
threshold = np.percentile(gmm.score_samples(X_train), 5)  # 5th percentile of training
anomalies = X_test[log_probs < threshold]

Interview Answer

"Every loss function corresponds to a distributional assumption about the output. MSE assumes Gaussian errors (minimising MSE = MLE under N(ŷ, σ²)); binary cross-entropy assumes Bernoulli outputs (MLE under Bernoulli(p̂)); categorical cross-entropy assumes categorical outputs (MLE under Categorical distribution). Regularisation is a distributional prior: L2 weight decay corresponds to a Gaussian prior on weights (MAP estimation), L1 to a Laplace prior. Generative models like VAEs explicitly model distributions — the latent space is Gaussian by design, and the KL divergence loss term pushes the encoder's output distribution toward the prior. Understanding this connection makes model design principled: choose the distribution whose assumptions match your data."

Probability Distributions in Machine Learning

The Connection: Loss Functions as Log-Likelihoods

Output Distributions

Regularisation as Priors

Generative Models and Distributions

Mixture Distributions

Interview Answer

Enjoyed this article?

Leave a comment