Joint, Marginal, and Conditional Probability
The three types of probability and how they relate — joint P(A,B), marginal P(A), and conditional P(A|B) — with medical examples and ML applications.
Three Types of Probability
Joint probability P(A, B) = P(A ∩ B):
Probability that BOTH A and B occur
"Patient has AF AND is on warfarin"
Marginal probability P(A):
Probability that A occurs regardless of B
"Patient has AF" (ignoring warfarin status)
Conditional probability P(A | B):
Probability that A occurs GIVEN that B has already occurred
"Given the patient is on warfarin, what's the probability they have AF?"The Relationships
P(A | B) = P(A, B) / P(B) [definition of conditional probability]
Rearranged:
P(A, B) = P(A | B) × P(B) [multiplication rule]
Marginalisation (summing over the other variable):
P(A) = Σ_b P(A, B = b) [for discrete B]
P(A) = ∫ P(A, B) dB [for continuous B]Worked Example: Clinical Lab Results
Population of 1000 patients:
| Warfarin | No Warfarin | Total
AF | 200 | 100 | 300
No AF | 150 | 550 | 700
Total | 350 | 650 | 1000
Joint probabilities:
P(AF, Warfarin) = 200/1000 = 0.20
P(AF, No Warfarin) = 100/1000 = 0.10
P(No AF, Warfarin) = 150/1000 = 0.15
P(No AF, No Warfarin) = 550/1000 = 0.55
Marginal probabilities (sum rows/columns):
P(AF) = (200 + 100) / 1000 = 0.30
P(Warfarin) = (200 + 150) / 1000 = 0.35
Conditional probabilities:
P(AF | Warfarin) = P(AF, Warfarin) / P(Warfarin) = 0.20 / 0.35 = 0.571
→ 57% of patients on warfarin have AF
P(Warfarin | AF) = P(AF, Warfarin) / P(AF) = 0.20 / 0.30 = 0.667
→ 67% of AF patients are on warfarinPython: Computing All Three
import numpy as np
import pandas as pd
# Contingency table
data = pd.DataFrame({
"AF": [1, 1, 0, 0] * 250,
"Warfarin": [1, 0, 1, 0] * 250,
})
# Adjust counts to match example
# (simplified — real example uses actual data)
# Cross-tabulation
ct = pd.crosstab(data["AF"], data["Warfarin"])
n = len(data)
# Joint probabilities
joint = ct / n
print("Joint P(AF, Warfarin):")
print(joint)
# Marginal probabilities
p_af = joint.sum(axis=1) # sum over Warfarin columns
p_warfarin = joint.sum(axis=0) # sum over AF rows
print(f"\nP(AF=1) = {p_af[1]:.3f}")
print(f"P(Warfarin=1) = {p_warfarin[1]:.3f}")
# Conditional probability P(AF | Warfarin)
def conditional(joint_df, given_col_val, given_col, target_row):
"""P(target_row | given_col == given_col_val)"""
marginal = joint_df[given_col_val].sum() # P(given_col = given_col_val)
joint_val = joint_df.loc[target_row, given_col_val]
return float(joint_val / marginal)
p_af_given_warfarin = conditional(joint, 1, "Warfarin", 1)
print(f"\nP(AF=1 | Warfarin=1) = {p_af_given_warfarin:.3f}")Marginalisation: Summing Over Unknowns
# Law of Total Probability:
# P(A) = Σ P(A | B = b) × P(B = b)
#
# Example: P(disease positive test)
# = P(positive | disease) × P(disease) + P(positive | no disease) × P(no disease)
def total_probability(
p_b: dict, # {b_value: P(B = b_value)}
p_a_given_b: dict, # {b_value: P(A | B = b_value)}
) -> float:
return sum(p_b[b] * p_a_given_b[b] for b in p_b)
# P(positive test) for a disease with:
# prevalence = 5%, sensitivity = 90%, specificity = 95%
p_disease = {"has_disease": 0.05, "no_disease": 0.95}
p_positive_given = {"has_disease": 0.90, "no_disease": 0.05}
p_positive = total_probability(p_disease, p_positive_given)
print(f"P(positive test) = {p_positive:.4f}") # 0.0925In Neural Networks: P(y | x)
Every classification neural network outputs P(y | x):
P(class = k | input features = x)
This is a conditional probability:
"Given these pixel values, what is the probability that the image is a cat?"
The softmax function ensures these probabilities sum to 1 across classes:
P(y = 1 | x) + P(y = 2 | x) + ... + P(y = K | x) = 1
This is the marginal constraint: summing the conditional probabilities
over all class values equals 1 (normalisation axiom).
In generative models (e.g., GPT):
P(next_token | previous_tokens)
Joint probability of a sequence:
P(w₁, w₂, ..., wₙ) = P(w₁) × P(w₂|w₁) × P(w₃|w₁,w₂) × ...
(chain rule of probability)Interview Answer
"Joint probability P(A,B) is the probability both A and B occur. Marginal probability P(A) is obtained by summing the joint over all values of B — 'marginalising out' B. Conditional probability P(A|B) = P(A,B)/P(B) is the probability of A given that B occurred. These three are connected: knowing any two lets you compute the third. In ML: every classifier outputs a conditional probability P(y|x); the law of total probability lets you compute P(y) by marginalising over all possible input distributions; and the chain rule of probability underpins language model training — P(sequence) = product of conditional token probabilities."
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.