Statistics & Math for AI/ML Interviews · Lesson 27 of 30
Correlation vs Causation
The Classic Warning
"Correlation does not imply causation" means: just because two variables move together does not mean one causes the other.
Strong correlation (r ≈ 0.99):
Ice cream sales and drowning rates (by month)
Does ice cream cause drowning? No.
Confounding variable: summer (warm weather → more swimming + more ice cream)
Removing the confounder (season) eliminates the correlation.Why Correlations Arise Without Causation
1. Common cause (confounding):
A → X and A → Y, so X correlates with Y
Example: poverty → poor diet AND poor health → diet correlates with health
(diet doesn't necessarily cause health outcomes directly; poverty drives both)
2. Reverse causation:
Y causes X, not X causes Y
Example: hospitalised patients have more medication use
Correlation: medication ~ hospitalisation
Reality: being sick causes both hospitalisation and medication, not vice versa
3. Coincidence (spurious correlation):
No causal mechanism at all — just happened to correlate in this dataset
Example: number of Nicolas Cage films per year correlates with pool drownings
Reduces with more data / external validation
4. Selection bias:
The way data was collected creates a correlation
Example: Berkson's paradox — in hospitalised patients, two diseases may
appear negatively correlated even if independent in the populationClinical ML Examples
Correlation: patients on more medications have worse outcomes
Naive interpretation: medications cause worse outcomes → prescribe less
Reality: sicker patients receive more medications → confounding by illness severity
Fix: control for comorbidity score (e.g., Charlson index)
Correlation: hospitals with more ICU beds have higher mortality rates
Naive interpretation: ICU beds cause death
Reality: sicker patients are referred to hospitals with more ICU capacity
Fix: risk-adjust for case mix before comparing hospitals
Correlation: model trained on EHR data → high body mass index predicts cancer
Naive interpretation: BMI causes cancer
Reality: overweight patients may receive more cancer screening → detection bias
BMI correlates with cancer diagnoses, not necessarily cancer incidenceHow to Investigate Causation
Randomised Controlled Trial (RCT): gold standard
Randomly assign subjects to treatment vs control
Randomisation breaks all confounding
Limitation: often impractical, expensive, or unethical in medicine
Instrumental Variables:
Find a variable that affects X but has no direct effect on Y
Use it to isolate causal variation in X
Example: distance to hospital as instrument for treatment (affects treatment access
but doesn't directly affect health outcome)
Difference-in-Differences:
Compare before/after in treated group vs control group over same period
Removes time-constant confounders
Propensity Score Matching:
In observational data, match treated and untreated patients on confounders
Then compare within matched pairsIn Machine Learning
# ML models learn correlations, not causation
# This has practical consequences:
# Example: A model predicts hospital readmission
# Feature: number_of_medications (highly correlated with readmission)
#
# If a hospital reduces medications to lower predicted readmission:
# → Model accuracy may drop (distribution shift)
# → Patient outcomes may worsen (medications were there for a reason)
# → The intervention acted on a correlated feature, not a causal mechanism
# Causal ML approaches:
# 1. Include domain knowledge: don't use features that are caused by the outcome
# 2. Structural causal models: explicit DAG of causal relationships
# 3. Counterfactual reasoning: "what would have happened if we hadn't intervened?"
# Detecting confounding in feature importance:
# If a feature is important in training but not in a different hospital's data:
# → May be a confounder specific to that hospital's patient population
def check_feature_stability(
model,
X_site1: "pd.DataFrame",
X_site2: "pd.DataFrame",
feature_names: list[str],
) -> dict:
"""Compare feature importances across sites — instability suggests confounding."""
import numpy as np
imp1 = model.feature_importances_ # for tree models
# Retrain on site2 (or use SHAP values at each site)
return {
"features": feature_names,
"site1_importance": imp1.tolist(),
"note": "High importance in site1 but not site2 = possible confounder",
}A/B Testing: Establishing Causation
The standard way to establish causal effect of an ML system change:
A/B test = randomised experiment in software
Randomly assign users to control (A) or treatment (B)
Measure outcome difference
"The new RAG system increased user satisfaction by 12%"
→ This IS a causal claim (randomised assignment breaks confounding)
"Users who clicked on recommendation X had higher engagement"
→ This is correlation (users who click on X may differ from those who don't)
→ Cannot claim X caused engagement without A/B test
Always design experiments for causal inference when making product decisions.
Always treat ML model outputs as correlational unless validated with an experiment.Interview Answer
"Correlation measures how two variables move together; causation means one variable directly influences the other. The three main reasons variables correlate without causation: common cause (a third variable drives both), reverse causation (Y causes X, not vice versa), and spurious coincidence. In clinical ML, this matters critically: a model seeing that medication count correlates with poor outcomes should not lead to reducing medications — the confound is illness severity, which causes both. Establishing causation requires randomisation (RCT or A/B test) or causal inference techniques (instrumental variables, propensity score matching, difference-in-differences). ML models learn correlations; deploying them to intervene on a correlated but non-causal feature can cause harm."