Learnixo
Back to blog
AI Systemsbeginner

Joint, Marginal, and Conditional Probability

The three types of probability and how they relate — joint P(A,B), marginal P(A), and conditional P(A|B) — with medical examples and ML applications.

Asma Hafeez KhanMay 21, 20264 min read
ProbabilityJointMarginalConditionalInterview
Share:𝕏

Three Types of Probability

Joint probability P(A, B) = P(A ∩ B):
  Probability that BOTH A and B occur
  "Patient has AF AND is on warfarin"

Marginal probability P(A):
  Probability that A occurs regardless of B
  "Patient has AF" (ignoring warfarin status)

Conditional probability P(A | B):
  Probability that A occurs GIVEN that B has already occurred
  "Given the patient is on warfarin, what's the probability they have AF?"

The Relationships

P(A | B) = P(A, B) / P(B)           [definition of conditional probability]

Rearranged:
P(A, B) = P(A | B) × P(B)           [multiplication rule]

Marginalisation (summing over the other variable):
P(A) = Σ_b P(A, B = b)              [for discrete B]
P(A) = ∫ P(A, B) dB                  [for continuous B]

Worked Example: Clinical Lab Results

Population of 1000 patients:

              | Warfarin | No Warfarin | Total
AF            |   200    |     100     |  300
No AF         |   150    |     550     |  700
Total         |   350    |     650     | 1000

Joint probabilities:
  P(AF, Warfarin)    = 200/1000 = 0.20
  P(AF, No Warfarin) = 100/1000 = 0.10
  P(No AF, Warfarin) = 150/1000 = 0.15
  P(No AF, No Warfarin) = 550/1000 = 0.55

Marginal probabilities (sum rows/columns):
  P(AF)       = (200 + 100) / 1000 = 0.30
  P(Warfarin) = (200 + 150) / 1000 = 0.35

Conditional probabilities:
  P(AF | Warfarin) = P(AF, Warfarin) / P(Warfarin) = 0.20 / 0.35 = 0.571
  → 57% of patients on warfarin have AF
  
  P(Warfarin | AF) = P(AF, Warfarin) / P(AF) = 0.20 / 0.30 = 0.667
  → 67% of AF patients are on warfarin

Python: Computing All Three

Python
import numpy as np
import pandas as pd

# Contingency table
data = pd.DataFrame({
    "AF": [1, 1, 0, 0] * 250,
    "Warfarin": [1, 0, 1, 0] * 250,
})
# Adjust counts to match example
# (simplified  real example uses actual data)

# Cross-tabulation
ct = pd.crosstab(data["AF"], data["Warfarin"])
n = len(data)

# Joint probabilities
joint = ct / n
print("Joint P(AF, Warfarin):")
print(joint)

# Marginal probabilities
p_af      = joint.sum(axis=1)  # sum over Warfarin columns
p_warfarin = joint.sum(axis=0)  # sum over AF rows
print(f"\nP(AF=1) = {p_af[1]:.3f}")
print(f"P(Warfarin=1) = {p_warfarin[1]:.3f}")

# Conditional probability P(AF | Warfarin)
def conditional(joint_df, given_col_val, given_col, target_row):
    """P(target_row | given_col == given_col_val)"""
    marginal = joint_df[given_col_val].sum()  # P(given_col = given_col_val)
    joint_val = joint_df.loc[target_row, given_col_val]
    return float(joint_val / marginal)

p_af_given_warfarin = conditional(joint, 1, "Warfarin", 1)
print(f"\nP(AF=1 | Warfarin=1) = {p_af_given_warfarin:.3f}")

Marginalisation: Summing Over Unknowns

Python
# Law of Total Probability:
# P(A) = Σ P(A | B = b) × P(B = b)
#
# Example: P(disease positive test) 
# = P(positive | disease) × P(disease) + P(positive | no disease) × P(no disease)

def total_probability(
    p_b: dict,             # {b_value: P(B = b_value)}
    p_a_given_b: dict,     # {b_value: P(A | B = b_value)}
) -> float:
    return sum(p_b[b] * p_a_given_b[b] for b in p_b)

# P(positive test) for a disease with:
#   prevalence = 5%, sensitivity = 90%, specificity = 95%
p_disease = {"has_disease": 0.05, "no_disease": 0.95}
p_positive_given = {"has_disease": 0.90, "no_disease": 0.05}

p_positive = total_probability(p_disease, p_positive_given)
print(f"P(positive test) = {p_positive:.4f}")  # 0.0925

In Neural Networks: P(y | x)

Every classification neural network outputs P(y | x):
  P(class = k | input features = x)
  
  This is a conditional probability:
  "Given these pixel values, what is the probability that the image is a cat?"

The softmax function ensures these probabilities sum to 1 across classes:
  P(y = 1 | x) + P(y = 2 | x) + ... + P(y = K | x) = 1

This is the marginal constraint: summing the conditional probabilities
over all class values equals 1 (normalisation axiom).

In generative models (e.g., GPT):
  P(next_token | previous_tokens)
  Joint probability of a sequence:
  P(w₁, w₂, ..., wₙ) = P(w₁) × P(w₂|w₁) × P(w₃|w₁,w₂) × ...
  (chain rule of probability)

Interview Answer

"Joint probability P(A,B) is the probability both A and B occur. Marginal probability P(A) is obtained by summing the joint over all values of B — 'marginalising out' B. Conditional probability P(A|B) = P(A,B)/P(B) is the probability of A given that B occurred. These three are connected: knowing any two lets you compute the third. In ML: every classifier outputs a conditional probability P(y|x); the law of total probability lets you compute P(y) by marginalising over all possible input distributions; and the chain rule of probability underpins language model training — P(sequence) = product of conditional token probabilities."

Enjoyed this article?

Explore the AI Systems learning path for more.

Found this helpful?

Share:𝕏

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.