Statistics & Math for AI/ML Interviews · Lesson 13 of 30
Conditional Probability
The Definition
P(A | B) = P(A ∩ B) / P(B) where P(B) > 0
"The probability of A, given that B has occurred"
= probability of both A and B happening
divided by probability of B happeningGeometric Intuition
Original sample space Ω (all 1000 patients):
┌─────────────────────────────────┐
│ AF (300) No AF (700) │
└─────────────────────────────────┘
Condition on Warfarin: we zoom in on the 350 warfarin patients only
┌────────────────────┐
│ AF (200) | NoAF(150) [warfarin patients only]
└────────────────────┘
P(AF | Warfarin) = 200/350 = 0.571
We "restricted" the sample space to B (warfarin patients),
then computed how many of those have A (AF).Computing from a Contingency Table
Python
import pandas as pd
import numpy as np
# Counts table
table = pd.DataFrame(
[[200, 100], [150, 550]],
index=["AF", "No AF"],
columns=["Warfarin", "No Warfarin"],
)
n = table.values.sum() # 1000
print("Joint probabilities:")
joint = table / n
print(joint)
# Conditional: P(A | B) for each cell in a column
def conditional_given_column(joint_df: pd.DataFrame, given_col: str) -> pd.Series:
"""P(row | given_col) for each row."""
col_sum = joint_df[given_col].sum() # P(B) = marginal for that column
return joint_df[given_col] / col_sum
print("\nP(diagnosis | Warfarin):")
print(conditional_given_column(joint, "Warfarin"))
# AF: 0.571, No AF: 0.429
# All conditionals at once
def all_conditionals_given_column(joint_df: pd.DataFrame) -> pd.DataFrame:
return joint_df.div(joint_df.sum(axis=0), axis=1) # divide each column by its sum
print("\nAll P(row | column):")
print(all_conditionals_given_column(joint))The Multiplication Rule
From the definition of conditional probability:
P(A | B) = P(A ∩ B) / P(B)
Rearranging:
P(A ∩ B) = P(A | B) × P(B) [multiplication rule]
Also:
P(A ∩ B) = P(B | A) × P(A)
These two forms are equal → setting them equal gives Bayes' theorem:
P(A | B) × P(B) = P(B | A) × P(A)
P(A | B) = P(B | A) × P(A) / P(B)Chain Rule for Sequences
P(A, B, C) = P(A) × P(B | A) × P(C | A, B)
More generally:
P(X₁, X₂, ..., Xₙ) = Π P(Xᵢ | X₁, ..., Xᵢ₋₁)
This is how autoregressive language models work:
P("The cat sat") = P("The") × P("cat" | "The") × P("sat" | "The cat")Python
# Language model probability under chain rule
def sequence_probability(
tokens: list[str],
model, # language model with predict_next_token(context) -> {token: prob}
) -> float:
prob = 1.0
for i in range(1, len(tokens)):
context = tokens[:i]
next_token = tokens[i]
p_next = model.predict_next_token(context).get(next_token, 1e-10)
prob *= p_next
return probConditional Independence
A and B are conditionally independent given C if:
P(A | B, C) = P(A | C)
Equivalently: P(A, B | C) = P(A | C) × P(B | C)
In words: once you know C, knowing B adds no information about A.
Example (Naive Bayes assumption):
Symptoms are conditionally independent given disease diagnosis
P(fever, cough | flu) = P(fever | flu) × P(cough | flu)
This is usually not exactly true (fever and cough are correlated)
but it's a useful approximation that makes computation tractable.
Example (Markov property):
Future state depends only on current state, not history
P(Xₜ₊₁ | X₁, X₂, ..., Xₜ) = P(Xₜ₊₁ | Xₜ)
Used in: HMMs, reinforcement learning, time series modelsClassifier Output as Conditional Probability
Python
import torch
import torch.nn as nn
# A classifier outputs P(class | input features)
class SimpleClassifier(nn.Module):
def __init__(self, d_in, n_classes):
super().__init__()
self.net = nn.Sequential(
nn.Linear(d_in, 128),
nn.ReLU(),
nn.Linear(128, n_classes),
)
def forward(self, x: torch.Tensor) -> torch.Tensor:
logits = self.net(x)
return torch.softmax(logits, dim=-1) # P(class | x) for each class
def predict_proba(self, x: torch.Tensor) -> torch.Tensor:
with torch.no_grad():
return self.forward(x)
# The output satisfies:
# P(class=1 | x) + P(class=2 | x) + ... + P(class=K | x) = 1
# (conditional probability sums to 1 over mutually exclusive classes)
# Binary classifier threshold
def predict_with_threshold(model, x, threshold=0.5):
prob_class1 = model(x)[:, 1] # P(class=1 | x)
return (prob_class1 >= threshold).int()Interview Answer
"P(A|B) = P(A,B)/P(B) — restrict the sample space to B, then compute what fraction of those outcomes satisfy A. The multiplication rule follows directly: P(A,B) = P(A|B)×P(B). The chain rule extends this to sequences: P(X₁,...,Xₙ) = Π P(Xᵢ|X₁,...,Xᵢ₋₁) — this is the mathematical foundation of autoregressive language models. Every neural network classifier outputs a conditional probability P(y|x). Conditional independence (P(A|B,C) = P(A|C)) is the Naive Bayes assumption: given the class label, features are independent of each other — a simplification that makes inference tractable."