Learnixo

Statistics & Math for AI/ML Interviews · Lesson 9 of 30

Probability Fundamentals

What Probability Measures

Probability quantifies uncertainty — the likelihood that an event occurs.

P(event) ∈ [0, 1]

P(E) = 0: event E is impossible
P(E) = 1: event E is certain
P(E) = 0.5: event E occurs in half of all cases (on average)

Key Concepts

Sample space (Ω):
  The set of all possible outcomes of an experiment
  
  Coin flip: Ω = {H, T}
  Die roll: Ω = {1, 2, 3, 4, 5, 6}
  Patient outcome: Ω = {recovered, died, transferred}

Event:
  A subset of the sample space — a collection of outcomes
  
  "Roll an even number": E = {2, 4, 6}
  "Patient recovers": E = {recovered}

Complement:
  Eᶜ = all outcomes NOT in E
  P(Eᶜ) = 1 - P(E)

The Three Axioms of Probability

Axiom 1: Non-negativity
  P(E) ≥ 0 for any event E

Axiom 2: Normalisation
  P(Ω) = 1 (something must happen)

Axiom 3: Additivity (for mutually exclusive events)
  If A ∩ B = ∅ (A and B can't both happen):
  P(A ∪ B) = P(A) + P(B)

Everything else in probability theory follows from these three axioms.

Core Probability Rules

Addition rule (general):
  P(A ∪ B) = P(A) + P(B) - P(A ∩ B)
  
  Subtract P(A ∩ B) to avoid double-counting the overlap
  
  For mutually exclusive events (P(A ∩ B) = 0):
  P(A ∪ B) = P(A) + P(B)

Complement rule:
  P(Eᶜ) = 1 - P(E)
  
  Often easier to compute probability of NOT-E

Multiplication rule (for independent events):
  P(A ∩ B) = P(A) × P(B)  if A and B are independent
  
  P(two coin flips both heads) = 0.5 × 0.5 = 0.25

Python: Computing Basic Probabilities

Python
from fractions import Fraction

# Discrete uniform probability
def probability(event: set, sample_space: set) -> float:
    return len(event) / len(sample_space)

# Die roll examples
Omega = {1, 2, 3, 4, 5, 6}
even  = {2, 4, 6}
greater_than_4 = {5, 6}

p_even = probability(even, Omega)          # 0.5
p_gt4  = probability(greater_than_4, Omega) # 0.333

# Union
even_or_gt4 = even | greater_than_4         # {2, 4, 5, 6}
p_union = probability(even_or_gt4, Omega)   # 0.667
# Verify: P(A) + P(B) - P(A∩B) = 0.5 + 0.333 - 0.167 = 0.667

# Intersection
even_and_gt4 = even & greater_than_4        # {6}
p_intersect = probability(even_and_gt4, Omega)  # 0.167

# Complement
not_even = Omega - even                      # {1, 3, 5}
p_not_even = probability(not_even, Omega)   # 0.5 = 1 - 0.5 


# Simulation (when analytical computation is hard)
import numpy as np

def simulate_probability(event_fn, n_trials: int = 100_000) -> float:
    """Estimate P(event) by simulation."""
    outcomes = [event_fn() for _ in range(n_trials)]
    return sum(outcomes) / n_trials

# P(sum of two dice > 9)
def two_dice_sum_gt9():
    return (np.random.randint(1, 7) + np.random.randint(1, 7)) > 9

estimated = simulate_probability(two_dice_sum_gt9)
analytical = 6/36  # {(4,6),(5,5),(5,6),(6,4),(6,5),(6,6)} = 6 outcomes
print(f"Simulated: {estimated:.4f}, Analytical: {analytical:.4f}")

Interpretations of Probability

Frequentist:
  P(E) = long-run frequency of E in infinitely many repetitions
  "This drug works 73% of the time" = in 100 patients, ~73 would respond
  
  Requires repeatable experiments
  p-values, confidence intervals use this interpretation

Bayesian:
  P(E) = degree of belief that E is true
  "I believe there's a 60% chance this patient has sepsis"
  
  Can be assigned to one-off events
  Updates with new evidence via Bayes' theorem
  Posterior distributions, credible intervals use this interpretation

In Machine Learning

Python
# Model output probabilities: P(class = 1 | features)
# This is a conditional probability (covered next)

# Calibration: are predicted probabilities meaningful?
from sklearn.calibration import calibration_curve
import matplotlib.pyplot as plt

# Plot calibration curve
fraction_of_positives, mean_predicted_value = calibration_curve(
    y_true, y_pred_proba, n_bins=10
)
# Perfect calibration: fraction_of_positives == mean_predicted_value
# If model predicts 0.7, 70% of those examples should be positive

# Probability estimation from histogram
def estimate_probability_from_data(data: list, value: float, bandwidth: float) -> float:
    """Non-parametric probability density estimation."""
    from scipy.stats import gaussian_kde
    kde = gaussian_kde(data, bw_method=bandwidth)
    return float(kde(value))

Interview Answer

"Probability is built on three axioms: non-negativity (P ≥ 0), normalisation (P(Ω) = 1), and additivity for mutually exclusive events. From these, the core rules follow: addition rule P(A∪B) = P(A) + P(B) - P(A∩B), complement rule P(Eᶜ) = 1 - P(E), and multiplication rule for independent events. In ML, these rules appear everywhere: model output probabilities must sum to 1 across classes (softmax ensures this), probability calibration ensures predicted probabilities match empirical frequencies, and all Bayesian reasoning is built on these axioms via Bayes' theorem."