Learnixo

Statistics & Math for AI/ML Interviews · Lesson 13 of 30

Conditional Probability

The Definition

P(A | B) = P(A ∩ B) / P(B)    where P(B) > 0

"The probability of A, given that B has occurred"
= probability of both A and B happening
  divided by probability of B happening

Geometric Intuition

Original sample space Ω (all 1000 patients):
  ┌─────────────────────────────────┐
  │  AF (300)      No AF (700)      │
  └─────────────────────────────────┘

Condition on Warfarin: we zoom in on the 350 warfarin patients only
  ┌────────────────────┐
  │  AF (200) | NoAF(150)   [warfarin patients only]
  └────────────────────┘

P(AF | Warfarin) = 200/350 = 0.571

We "restricted" the sample space to B (warfarin patients),
then computed how many of those have A (AF).

Computing from a Contingency Table

Python
import pandas as pd
import numpy as np

# Counts table
table = pd.DataFrame(
    [[200, 100], [150, 550]],
    index=["AF", "No AF"],
    columns=["Warfarin", "No Warfarin"],
)
n = table.values.sum()  # 1000

print("Joint probabilities:")
joint = table / n
print(joint)

# Conditional: P(A | B) for each cell in a column
def conditional_given_column(joint_df: pd.DataFrame, given_col: str) -> pd.Series:
    """P(row | given_col) for each row."""
    col_sum = joint_df[given_col].sum()  # P(B) = marginal for that column
    return joint_df[given_col] / col_sum

print("\nP(diagnosis | Warfarin):")
print(conditional_given_column(joint, "Warfarin"))
# AF: 0.571, No AF: 0.429

# All conditionals at once
def all_conditionals_given_column(joint_df: pd.DataFrame) -> pd.DataFrame:
    return joint_df.div(joint_df.sum(axis=0), axis=1)  # divide each column by its sum

print("\nAll P(row | column):")
print(all_conditionals_given_column(joint))

The Multiplication Rule

From the definition of conditional probability:

P(A | B) = P(A ∩ B) / P(B)

Rearranging:
P(A ∩ B) = P(A | B) × P(B)    [multiplication rule]

Also:
P(A ∩ B) = P(B | A) × P(A)

These two forms are equal → setting them equal gives Bayes' theorem:
P(A | B) × P(B) = P(B | A) × P(A)
P(A | B) = P(B | A) × P(A) / P(B)

Chain Rule for Sequences

P(A, B, C) = P(A) × P(B | A) × P(C | A, B)

More generally:
P(X₁, X₂, ..., Xₙ) = Π P(Xᵢ | X₁, ..., Xᵢ₋₁)

This is how autoregressive language models work:
  P("The cat sat") = P("The") × P("cat" | "The") × P("sat" | "The cat")
Python
# Language model probability under chain rule
def sequence_probability(
    tokens: list[str],
    model,  # language model with predict_next_token(context) -> {token: prob}
) -> float:
    prob = 1.0
    for i in range(1, len(tokens)):
        context = tokens[:i]
        next_token = tokens[i]
        p_next = model.predict_next_token(context).get(next_token, 1e-10)
        prob *= p_next
    return prob

Conditional Independence

A and B are conditionally independent given C if:
  P(A | B, C) = P(A | C)
  Equivalently: P(A, B | C) = P(A | C) × P(B | C)

In words: once you know C, knowing B adds no information about A.

Example (Naive Bayes assumption):
  Symptoms are conditionally independent given disease diagnosis
  P(fever, cough | flu) = P(fever | flu) × P(cough | flu)
  
  This is usually not exactly true (fever and cough are correlated)
  but it's a useful approximation that makes computation tractable.

Example (Markov property):
  Future state depends only on current state, not history
  P(Xₜ₊₁ | X₁, X₂, ..., Xₜ) = P(Xₜ₊₁ | Xₜ)
  Used in: HMMs, reinforcement learning, time series models

Classifier Output as Conditional Probability

Python
import torch
import torch.nn as nn

# A classifier outputs P(class | input features)
class SimpleClassifier(nn.Module):
    def __init__(self, d_in, n_classes):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(d_in, 128),
            nn.ReLU(),
            nn.Linear(128, n_classes),
        )
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        logits = self.net(x)
        return torch.softmax(logits, dim=-1)  # P(class | x) for each class
    
    def predict_proba(self, x: torch.Tensor) -> torch.Tensor:
        with torch.no_grad():
            return self.forward(x)

# The output satisfies:
# P(class=1 | x) + P(class=2 | x) + ... + P(class=K | x) = 1
# (conditional probability sums to 1 over mutually exclusive classes)

# Binary classifier threshold
def predict_with_threshold(model, x, threshold=0.5):
    prob_class1 = model(x)[:, 1]  # P(class=1 | x)
    return (prob_class1 >= threshold).int()

Interview Answer

"P(A|B) = P(A,B)/P(B) — restrict the sample space to B, then compute what fraction of those outcomes satisfy A. The multiplication rule follows directly: P(A,B) = P(A|B)×P(B). The chain rule extends this to sequences: P(X₁,...,Xₙ) = Π P(Xᵢ|X₁,...,Xᵢ₋₁) — this is the mathematical foundation of autoregressive language models. Every neural network classifier outputs a conditional probability P(y|x). Conditional independence (P(A|B,C) = P(A|C)) is the Naive Bayes assumption: given the class label, features are independent of each other — a simplification that makes inference tractable."