Prior, Likelihood, and Posterior
The three components of Bayesian inference โ what each term means, how to choose priors, and how the posterior combines prior belief with evidence.
The Three Terms
P(H | E) = P(E | H) ร P(H) / P(E)
โโโโโโโโโโ โโโโโโ โโโโโ
Likelihood Prior Evidence
Prior P(H):
Your belief about H BEFORE seeing the evidence
"Based on population data, 5% of patients have this condition"
Likelihood P(E | H):
How probable is the observed evidence IF the hypothesis were true?
"If the patient has the condition, how likely is this test result?"
Posterior P(H | E):
Your UPDATED belief about H AFTER seeing the evidence
"Given this test result, what's the probability the patient has the condition?"
Evidence P(E) = ฮฃ P(E | Hแตข) ร P(Hแตข):
Normalisation constant โ ensures posterior sums to 1
Computed via law of total probabilityChoosing Priors
Types of priors:
Informative prior:
Strong belief based on prior knowledge
Example: P(warfarin dose > 20mg/day) โ 0 โ physiologically impossible
Use when: domain knowledge is strong and should constrain the model
Weakly informative prior:
Some structure but broad uncertainty
Example: weight ~ Normal(0, 1) for regression coefficients (regularisation!)
Use when: you have general domain knowledge but not specific values
Non-informative (flat) prior:
P(H) = constant for all values of H
"All hypotheses are equally likely before seeing data"
Problem: "equally likely" depends on parameterisation
Better: Jeffreys prior (invariant to reparameterisation)
Empirical Bayes:
Estimate the prior from the data itself
Semi-Bayesian โ violates strict Bayesian principles but practical
Example: estimate mean and variance of prior from the training setPython: Visualising Prior and Posterior
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import beta
# Example: estimating probability of sepsis from clinical signs
# Using Beta distribution (conjugate prior for binomial likelihood)
# Prior: Beta(ฮฑ=2, ฮฒ=10) โ expect sepsis is uncommon
# Encodes: "seen 2 sepsis cases and 10 non-sepsis in prior experience"
prior_a, prior_b = 2, 10
theta = np.linspace(0, 1, 1000)
prior_dist = beta(prior_a, prior_b)
# Evidence: 8 of 20 new patients with these signs had sepsis
n_patients = 20
n_sepsis = 8
n_no_sepsis = n_patients - n_sepsis
# Posterior: Beta(ฮฑ + successes, ฮฒ + failures)
# (Beta-Binomial conjugacy: posterior is also Beta)
posterior_a = prior_a + n_sepsis
posterior_b = prior_b + n_no_sepsis
posterior_dist = beta(posterior_a, posterior_b)
print(f"Prior mean: {prior_a / (prior_a + prior_b):.3f}") # 0.167
print(f"Posterior mean: {posterior_a / (posterior_a + posterior_b):.3f}") # 0.375
print(f"MLE (from data alone): {n_sepsis / n_patients:.3f}") # 0.400
# Posterior is pulled between prior and MLE
# Credible interval (Bayesian equivalent of confidence interval)
lower, upper = posterior_dist.ppf([0.025, 0.975])
print(f"95% Credible interval: ({lower:.3f}, {upper:.3f})")Conjugate Priors
A conjugate prior is one where the posterior has the same distribution family as the prior.
This makes the update analytically tractable.
Common conjugate pairs:
Likelihood โ Prior = Posterior
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Binomial โ Beta(ฮฑ, ฮฒ) โ Beta(ฮฑ+k, ฮฒ+n-k)
Gaussian โ Gaussian โ Gaussian
Poisson โ Gamma โ Gamma
Multinomial โ Dirichlet โ Dirichlet
Dirichlet is the conjugate prior for language model token probabilities:
Prior: Dirichlet(ฮฑ) for each n-gram distribution
Data: observed token counts
Posterior: Dirichlet(ฮฑ + counts)
This is essentially smoothing (Laplace/Kneser-Ney):
Adding ฮฑ to counts before normalising = using a Dirichlet priorfrom scipy.stats import dirichlet
import numpy as np
# Example: prior over next token in clinical text
# Vocabulary: {warfarin, aspirin, ibuprofen, other}
# Prior: Dirichlet(ฮฑ = [2, 1, 1, 10]) โ expect "other" most common
prior_alpha = np.array([2.0, 1.0, 1.0, 10.0])
# Observed: in warfarin-related notes, counts = [50, 5, 2, 43]
observed_counts = np.array([50, 5, 2, 43])
# Posterior
posterior_alpha = prior_alpha + observed_counts
posterior_mean = posterior_alpha / posterior_alpha.sum()
print(f"Posterior probabilities: {posterior_mean}")Maximum Likelihood vs Bayesian
MLE (Maximum Likelihood Estimation):
ฮธ_MLE = argmax P(data | ฮธ)
Just maximise the likelihood โ ignore the prior
Equivalent to Bayesian with flat prior
Problem: overfits with small data (extreme MLE estimates)
MAP (Maximum A Posteriori):
ฮธ_MAP = argmax P(ฮธ | data) = argmax P(data | ฮธ) ร P(ฮธ)
Maximise the posterior โ uses the prior as regularisation
Equivalent to: L2 regularisation with Gaussian prior
or L1 regularisation with Laplace prior
Full Bayesian inference:
Don't pick a single ฮธ โ integrate over the full posterior distribution
Gives calibrated uncertainty estimates
Computationally expensive (MCMC, variational inference)
Used in: Bayesian deep learning, Gaussian processes, probabilistic programmingInterview Answer
"The three Bayes terms: prior P(H) encodes your belief before seeing data; likelihood P(E|H) is how probable the evidence is assuming H is true; posterior P(H|E) is the updated belief combining both. The evidence P(E) is just a normalisation constant. Prior choice matters most with small datasets โ a strongly informative prior constrains the posterior toward domain knowledge. With large data, the likelihood dominates and the posterior converges regardless of the prior. In ML: L2 regularisation corresponds to a Gaussian prior on weights (MAP estimation); L1 corresponds to a Laplace prior. Full Bayesian inference integrates over all parameter values, yielding calibrated uncertainty โ computationally expensive but valuable when uncertainty quantification matters (clinical AI, safety-critical systems)."
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.