Statistics & Math for AI/ML Interviews · Lesson 17 of 30
Bayes' Theorem
The Formula
P(H | E) = P(E | H) × P(H) / P(E)
Where:
H = hypothesis (what we want to know)
E = evidence (what we observed)
P(H) = prior probability of H (before seeing E)
P(E | H) = likelihood of observing E if H is true
P(H | E) = posterior probability of H given E (after seeing E)
P(E) = marginal probability of E (normalisation constant)In Words
Posterior = Likelihood × Prior / Evidence
Or: posterior ∝ likelihood × prior (proportional — before normalising)
"Update your prior belief about H in light of the evidence E,
weighted by how likely E would be if H were true."Medical Diagnosis Example
Disease: AF (atrial fibrillation)
Evidence: irregular pulse on clinical examination
Prior: P(AF) = 0.02 (2% prevalence in adults under 65)
Sensitivity: P(irregular pulse | AF) = 0.90
Specificity: P(irregular pulse | no AF) = 0.10
→ P(irregular | no AF) = 1 - 0.90... no:
False positive rate = P(irregular | no AF) = 0.10
P(irregular) = P(irregular | AF) × P(AF) + P(irregular | no AF) × P(no AF)
= 0.90 × 0.02 + 0.10 × 0.98
= 0.018 + 0.098 = 0.116
P(AF | irregular pulse) = P(irregular | AF) × P(AF) / P(irregular)
= 0.90 × 0.02 / 0.116
= 0.018 / 0.116
≈ 0.155 (15.5%)
Takeaway: despite a 90% sensitive test, only 15.5% of patients with
an irregular pulse have AF — because AF is rare.
This is the base rate fallacy at work.Python Implementation
def bayes_theorem(
p_hypothesis: float, # P(H) — prior
p_evidence_given_hypothesis: float, # P(E | H) — likelihood
p_evidence_given_not_hypothesis: float, # P(E | ¬H) — false positive rate
) -> dict:
"""Compute posterior P(H | E) using Bayes' theorem."""
p_not_hypothesis = 1 - p_hypothesis
# P(E) via law of total probability
p_evidence = (
p_evidence_given_hypothesis * p_hypothesis
+ p_evidence_given_not_hypothesis * p_not_hypothesis
)
# Bayes' theorem
p_hypothesis_given_evidence = (
p_evidence_given_hypothesis * p_hypothesis / p_evidence
)
return {
"prior": p_hypothesis,
"likelihood": p_evidence_given_hypothesis,
"p_evidence": p_evidence,
"posterior": p_hypothesis_given_evidence,
"posterior_update_factor": p_hypothesis_given_evidence / p_hypothesis,
}
result = bayes_theorem(
p_hypothesis=0.02, # P(AF) = 2%
p_evidence_given_hypothesis=0.90, # sensitivity
p_evidence_given_not_hypothesis=0.10, # 1 - specificity
)
print(f"Prior P(AF): {result['prior']:.3f}")
print(f"Posterior P(AF | irregular pulse): {result['posterior']:.3f}")
print(f"Update factor: {result['posterior_update_factor']:.1f}×")
# Prior 0.02 → Posterior 0.155 (7.75× update)Sequential Updating
One of Bayes' theorem's powers: update beliefs as new evidence arrives.
def sequential_bayes_update(
prior: float,
evidence_sequence: list[dict], # [{"likelihood": ..., "false_positive_rate": ...}]
) -> list[float]:
"""Update P(H) sequentially as each piece of evidence arrives."""
posteriors = [prior]
current = prior
for ev in evidence_sequence:
result = bayes_theorem(
p_hypothesis=current,
p_evidence_given_hypothesis=ev["likelihood"],
p_evidence_given_not_hypothesis=ev["false_positive_rate"],
)
current = result["posterior"]
posteriors.append(current)
return posteriors
# Clinical pathway: each test updates belief in diagnosis
priors_over_time = sequential_bayes_update(
prior=0.02, # initial P(AF) = 2%
evidence_sequence=[
{"likelihood": 0.90, "false_positive_rate": 0.10}, # irregular pulse
{"likelihood": 0.85, "false_positive_rate": 0.05}, # ECG abnormal
{"likelihood": 0.95, "false_positive_rate": 0.02}, # echocardiogram
]
)
for i, p in enumerate(priors_over_time):
labels = ["Initial", "After pulse", "After ECG", "After echo"]
print(f"{labels[i]}: P(AF) = {p:.4f} ({p*100:.1f}%)")Bayes' Theorem and Machine Learning
Every Bayesian ML method uses this update:
Naive Bayes classifier:
P(class | features) ∝ P(features | class) × P(class)
Prior = class frequency in training data
Likelihood = product of feature probabilities given class
Posterior = class probability given this example
Bayesian hyperparameter optimisation:
Prior = distribution over hyperparameter values
Likelihood = model performance at tested values
Posterior = updated distribution → guides next evaluation point
Bayesian neural networks:
Prior = distribution over network weights
Likelihood = training data fit given those weights
Posterior = weight distribution after training
Enables uncertainty quantificationInterview Answer
"Bayes' theorem states: P(H|E) = P(E|H) × P(H) / P(E). It updates a prior belief P(H) using observed evidence E: the posterior is proportional to the likelihood of observing E if H is true, times the prior. The classic trap is ignoring the prior — even with a 90% sensitive test, a positive result for a rare disease (2% prevalence) gives only ~15% posterior probability of disease, because most positives are false alarms. In ML: Naive Bayes directly applies this as P(class|features) ∝ P(features|class) × P(class); Bayesian optimisation uses it to update beliefs about which hyperparameters work best; and Bayesian neural networks place distributions over weights to quantify uncertainty."