Learnixo

Statistics & Math for AI/ML Interviews · Lesson 18 of 30

Prior, Likelihood, Posterior

The Three Terms

P(H | E) = P(E | H) × P(H) / P(E)
            ──────────  ──────  ─────
            Likelihood   Prior  Evidence

Prior P(H):
  Your belief about H BEFORE seeing the evidence
  "Based on population data, 5% of patients have this condition"
  
Likelihood P(E | H):
  How probable is the observed evidence IF the hypothesis were true?
  "If the patient has the condition, how likely is this test result?"
  
Posterior P(H | E):
  Your UPDATED belief about H AFTER seeing the evidence
  "Given this test result, what's the probability the patient has the condition?"
  
Evidence P(E) = Σ P(E | Hᵢ) × P(Hᵢ):
  Normalisation constant — ensures posterior sums to 1
  Computed via law of total probability

Choosing Priors

Types of priors:

Informative prior:
  Strong belief based on prior knowledge
  Example: P(warfarin dose > 20mg/day) ≈ 0 — physiologically impossible
  Use when: domain knowledge is strong and should constrain the model

Weakly informative prior:
  Some structure but broad uncertainty
  Example: weight ~ Normal(0, 1) for regression coefficients (regularisation!)
  Use when: you have general domain knowledge but not specific values

Non-informative (flat) prior:
  P(H) = constant for all values of H
  "All hypotheses are equally likely before seeing data"
  Problem: "equally likely" depends on parameterisation
  Better: Jeffreys prior (invariant to reparameterisation)

Empirical Bayes:
  Estimate the prior from the data itself
  Semi-Bayesian — violates strict Bayesian principles but practical
  Example: estimate mean and variance of prior from the training set

Python: Visualising Prior and Posterior

Python
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import beta

# Example: estimating probability of sepsis from clinical signs
# Using Beta distribution (conjugate prior for binomial likelihood)

# Prior: Beta(α=2, β=10)  expect sepsis is uncommon
# Encodes: "seen 2 sepsis cases and 10 non-sepsis in prior experience"
prior_a, prior_b = 2, 10
theta = np.linspace(0, 1, 1000)

prior_dist = beta(prior_a, prior_b)

# Evidence: 8 of 20 new patients with these signs had sepsis
n_patients = 20
n_sepsis = 8
n_no_sepsis = n_patients - n_sepsis

# Posterior: Beta(α + successes, β + failures)
# (Beta-Binomial conjugacy: posterior is also Beta)
posterior_a = prior_a + n_sepsis
posterior_b = prior_b + n_no_sepsis
posterior_dist = beta(posterior_a, posterior_b)

print(f"Prior mean: {prior_a / (prior_a + prior_b):.3f}")     # 0.167
print(f"Posterior mean: {posterior_a / (posterior_a + posterior_b):.3f}")  # 0.375
print(f"MLE (from data alone): {n_sepsis / n_patients:.3f}")  # 0.400
# Posterior is pulled between prior and MLE

# Credible interval (Bayesian equivalent of confidence interval)
lower, upper = posterior_dist.ppf([0.025, 0.975])
print(f"95% Credible interval: ({lower:.3f}, {upper:.3f})")

Conjugate Priors

A conjugate prior is one where the posterior has the same distribution family as the prior.
This makes the update analytically tractable.

Common conjugate pairs:
  Likelihood → Prior = Posterior
  ─────────────────────────────────────────────────────
  Binomial     → Beta(α, β)      → Beta(α+k, β+n-k)
  Gaussian     → Gaussian        → Gaussian
  Poisson      → Gamma           → Gamma
  Multinomial  → Dirichlet       → Dirichlet

Dirichlet is the conjugate prior for language model token probabilities:
  Prior: Dirichlet(α) for each n-gram distribution
  Data: observed token counts
  Posterior: Dirichlet(α + counts)
  
  This is essentially smoothing (Laplace/Kneser-Ney):
  Adding α to counts before normalising = using a Dirichlet prior
Python
from scipy.stats import dirichlet
import numpy as np

# Example: prior over next token in clinical text
# Vocabulary: {warfarin, aspirin, ibuprofen, other}
# Prior: Dirichlet(α = [2, 1, 1, 10])  expect "other" most common
prior_alpha = np.array([2.0, 1.0, 1.0, 10.0])

# Observed: in warfarin-related notes, counts = [50, 5, 2, 43]
observed_counts = np.array([50, 5, 2, 43])

# Posterior
posterior_alpha = prior_alpha + observed_counts
posterior_mean = posterior_alpha / posterior_alpha.sum()
print(f"Posterior probabilities: {posterior_mean}")

Maximum Likelihood vs Bayesian

MLE (Maximum Likelihood Estimation):
  θ_MLE = argmax P(data | θ)
  Just maximise the likelihood — ignore the prior
  Equivalent to Bayesian with flat prior
  Problem: overfits with small data (extreme MLE estimates)

MAP (Maximum A Posteriori):
  θ_MAP = argmax P(θ | data) = argmax P(data | θ) × P(θ)
  Maximise the posterior — uses the prior as regularisation
  Equivalent to: L2 regularisation with Gaussian prior
              or L1 regularisation with Laplace prior

Full Bayesian inference:
  Don't pick a single θ — integrate over the full posterior distribution
  Gives calibrated uncertainty estimates
  Computationally expensive (MCMC, variational inference)
  Used in: Bayesian deep learning, Gaussian processes, probabilistic programming

Interview Answer

"The three Bayes terms: prior P(H) encodes your belief before seeing data; likelihood P(E|H) is how probable the evidence is assuming H is true; posterior P(H|E) is the updated belief combining both. The evidence P(E) is just a normalisation constant. Prior choice matters most with small datasets — a strongly informative prior constrains the posterior toward domain knowledge. With large data, the likelihood dominates and the posterior converges regardless of the prior. In ML: L2 regularisation corresponds to a Gaussian prior on weights (MAP estimation); L1 corresponds to a Laplace prior. Full Bayesian inference integrates over all parameter values, yielding calibrated uncertainty — computationally expensive but valuable when uncertainty quantification matters (clinical AI, safety-critical systems)."