Statistics & Math for AI/ML Interviews · Lesson 9 of 30
Probability Fundamentals
What Probability Measures
Probability quantifies uncertainty — the likelihood that an event occurs.
P(event) ∈ [0, 1]
P(E) = 0: event E is impossible
P(E) = 1: event E is certain
P(E) = 0.5: event E occurs in half of all cases (on average)Key Concepts
Sample space (Ω):
The set of all possible outcomes of an experiment
Coin flip: Ω = {H, T}
Die roll: Ω = {1, 2, 3, 4, 5, 6}
Patient outcome: Ω = {recovered, died, transferred}
Event:
A subset of the sample space — a collection of outcomes
"Roll an even number": E = {2, 4, 6}
"Patient recovers": E = {recovered}
Complement:
Eᶜ = all outcomes NOT in E
P(Eᶜ) = 1 - P(E)The Three Axioms of Probability
Axiom 1: Non-negativity
P(E) ≥ 0 for any event E
Axiom 2: Normalisation
P(Ω) = 1 (something must happen)
Axiom 3: Additivity (for mutually exclusive events)
If A ∩ B = ∅ (A and B can't both happen):
P(A ∪ B) = P(A) + P(B)
Everything else in probability theory follows from these three axioms.Core Probability Rules
Addition rule (general):
P(A ∪ B) = P(A) + P(B) - P(A ∩ B)
Subtract P(A ∩ B) to avoid double-counting the overlap
For mutually exclusive events (P(A ∩ B) = 0):
P(A ∪ B) = P(A) + P(B)
Complement rule:
P(Eᶜ) = 1 - P(E)
Often easier to compute probability of NOT-E
Multiplication rule (for independent events):
P(A ∩ B) = P(A) × P(B) if A and B are independent
P(two coin flips both heads) = 0.5 × 0.5 = 0.25Python: Computing Basic Probabilities
Python
from fractions import Fraction
# Discrete uniform probability
def probability(event: set, sample_space: set) -> float:
return len(event) / len(sample_space)
# Die roll examples
Omega = {1, 2, 3, 4, 5, 6}
even = {2, 4, 6}
greater_than_4 = {5, 6}
p_even = probability(even, Omega) # 0.5
p_gt4 = probability(greater_than_4, Omega) # 0.333
# Union
even_or_gt4 = even | greater_than_4 # {2, 4, 5, 6}
p_union = probability(even_or_gt4, Omega) # 0.667
# Verify: P(A) + P(B) - P(A∩B) = 0.5 + 0.333 - 0.167 = 0.667
# Intersection
even_and_gt4 = even & greater_than_4 # {6}
p_intersect = probability(even_and_gt4, Omega) # 0.167
# Complement
not_even = Omega - even # {1, 3, 5}
p_not_even = probability(not_even, Omega) # 0.5 = 1 - 0.5 ✓
# Simulation (when analytical computation is hard)
import numpy as np
def simulate_probability(event_fn, n_trials: int = 100_000) -> float:
"""Estimate P(event) by simulation."""
outcomes = [event_fn() for _ in range(n_trials)]
return sum(outcomes) / n_trials
# P(sum of two dice > 9)
def two_dice_sum_gt9():
return (np.random.randint(1, 7) + np.random.randint(1, 7)) > 9
estimated = simulate_probability(two_dice_sum_gt9)
analytical = 6/36 # {(4,6),(5,5),(5,6),(6,4),(6,5),(6,6)} = 6 outcomes
print(f"Simulated: {estimated:.4f}, Analytical: {analytical:.4f}")Interpretations of Probability
Frequentist:
P(E) = long-run frequency of E in infinitely many repetitions
"This drug works 73% of the time" = in 100 patients, ~73 would respond
Requires repeatable experiments
p-values, confidence intervals use this interpretation
Bayesian:
P(E) = degree of belief that E is true
"I believe there's a 60% chance this patient has sepsis"
Can be assigned to one-off events
Updates with new evidence via Bayes' theorem
Posterior distributions, credible intervals use this interpretationIn Machine Learning
Python
# Model output probabilities: P(class = 1 | features)
# This is a conditional probability (covered next)
# Calibration: are predicted probabilities meaningful?
from sklearn.calibration import calibration_curve
import matplotlib.pyplot as plt
# Plot calibration curve
fraction_of_positives, mean_predicted_value = calibration_curve(
y_true, y_pred_proba, n_bins=10
)
# Perfect calibration: fraction_of_positives == mean_predicted_value
# If model predicts 0.7, 70% of those examples should be positive
# Probability estimation from histogram
def estimate_probability_from_data(data: list, value: float, bandwidth: float) -> float:
"""Non-parametric probability density estimation."""
from scipy.stats import gaussian_kde
kde = gaussian_kde(data, bw_method=bandwidth)
return float(kde(value))Interview Answer
"Probability is built on three axioms: non-negativity (P ≥ 0), normalisation (P(Ω) = 1), and additivity for mutually exclusive events. From these, the core rules follow: addition rule P(A∪B) = P(A) + P(B) - P(A∩B), complement rule P(Eᶜ) = 1 - P(E), and multiplication rule for independent events. In ML, these rules appear everywhere: model output probabilities must sum to 1 across classes (softmax ensures this), probability calibration ensures predicted probabilities match empirical frequencies, and all Bayesian reasoning is built on these axioms via Bayes' theorem."