Conditional Probability
Conditional probability in depth โ the definition, computing it from tables, Bayes' theorem derivation, and applications in ML classifiers.
The Definition
P(A | B) = P(A โฉ B) / P(B) where P(B) > 0
"The probability of A, given that B has occurred"
= probability of both A and B happening
divided by probability of B happeningGeometric Intuition
Original sample space ฮฉ (all 1000 patients):
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ AF (300) No AF (700) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Condition on Warfarin: we zoom in on the 350 warfarin patients only
โโโโโโโโโโโโโโโโโโโโโโ
โ AF (200) | NoAF(150) [warfarin patients only]
โโโโโโโโโโโโโโโโโโโโโโ
P(AF | Warfarin) = 200/350 = 0.571
We "restricted" the sample space to B (warfarin patients),
then computed how many of those have A (AF).Computing from a Contingency Table
import pandas as pd
import numpy as np
# Counts table
table = pd.DataFrame(
[[200, 100], [150, 550]],
index=["AF", "No AF"],
columns=["Warfarin", "No Warfarin"],
)
n = table.values.sum() # 1000
print("Joint probabilities:")
joint = table / n
print(joint)
# Conditional: P(A | B) for each cell in a column
def conditional_given_column(joint_df: pd.DataFrame, given_col: str) -> pd.Series:
"""P(row | given_col) for each row."""
col_sum = joint_df[given_col].sum() # P(B) = marginal for that column
return joint_df[given_col] / col_sum
print("\nP(diagnosis | Warfarin):")
print(conditional_given_column(joint, "Warfarin"))
# AF: 0.571, No AF: 0.429
# All conditionals at once
def all_conditionals_given_column(joint_df: pd.DataFrame) -> pd.DataFrame:
return joint_df.div(joint_df.sum(axis=0), axis=1) # divide each column by its sum
print("\nAll P(row | column):")
print(all_conditionals_given_column(joint))The Multiplication Rule
From the definition of conditional probability:
P(A | B) = P(A โฉ B) / P(B)
Rearranging:
P(A โฉ B) = P(A | B) ร P(B) [multiplication rule]
Also:
P(A โฉ B) = P(B | A) ร P(A)
These two forms are equal โ setting them equal gives Bayes' theorem:
P(A | B) ร P(B) = P(B | A) ร P(A)
P(A | B) = P(B | A) ร P(A) / P(B)Chain Rule for Sequences
P(A, B, C) = P(A) ร P(B | A) ร P(C | A, B)
More generally:
P(Xโ, Xโ, ..., Xโ) = ฮ P(Xแตข | Xโ, ..., Xแตขโโ)
This is how autoregressive language models work:
P("The cat sat") = P("The") ร P("cat" | "The") ร P("sat" | "The cat")# Language model probability under chain rule
def sequence_probability(
tokens: list[str],
model, # language model with predict_next_token(context) -> {token: prob}
) -> float:
prob = 1.0
for i in range(1, len(tokens)):
context = tokens[:i]
next_token = tokens[i]
p_next = model.predict_next_token(context).get(next_token, 1e-10)
prob *= p_next
return probConditional Independence
A and B are conditionally independent given C if:
P(A | B, C) = P(A | C)
Equivalently: P(A, B | C) = P(A | C) ร P(B | C)
In words: once you know C, knowing B adds no information about A.
Example (Naive Bayes assumption):
Symptoms are conditionally independent given disease diagnosis
P(fever, cough | flu) = P(fever | flu) ร P(cough | flu)
This is usually not exactly true (fever and cough are correlated)
but it's a useful approximation that makes computation tractable.
Example (Markov property):
Future state depends only on current state, not history
P(Xโโโ | Xโ, Xโ, ..., Xโ) = P(Xโโโ | Xโ)
Used in: HMMs, reinforcement learning, time series modelsClassifier Output as Conditional Probability
import torch
import torch.nn as nn
# A classifier outputs P(class | input features)
class SimpleClassifier(nn.Module):
def __init__(self, d_in, n_classes):
super().__init__()
self.net = nn.Sequential(
nn.Linear(d_in, 128),
nn.ReLU(),
nn.Linear(128, n_classes),
)
def forward(self, x: torch.Tensor) -> torch.Tensor:
logits = self.net(x)
return torch.softmax(logits, dim=-1) # P(class | x) for each class
def predict_proba(self, x: torch.Tensor) -> torch.Tensor:
with torch.no_grad():
return self.forward(x)
# The output satisfies:
# P(class=1 | x) + P(class=2 | x) + ... + P(class=K | x) = 1
# (conditional probability sums to 1 over mutually exclusive classes)
# Binary classifier threshold
def predict_with_threshold(model, x, threshold=0.5):
prob_class1 = model(x)[:, 1] # P(class=1 | x)
return (prob_class1 >= threshold).int()Interview Answer
"P(A|B) = P(A,B)/P(B) โ restrict the sample space to B, then compute what fraction of those outcomes satisfy A. The multiplication rule follows directly: P(A,B) = P(A|B)รP(B). The chain rule extends this to sequences: P(Xโ,...,Xโ) = ฮ P(Xแตข|Xโ,...,Xแตขโโ) โ this is the mathematical foundation of autoregressive language models. Every neural network classifier outputs a conditional probability P(y|x). Conditional independence (P(A|B,C) = P(A|C)) is the Naive Bayes assumption: given the class label, features are independent of each other โ a simplification that makes inference tractable."
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.