Learnixo

Statistics & Math for AI/ML Interviews · Lesson 27 of 30

Correlation vs Causation

The Classic Warning

"Correlation does not imply causation" means: just because two variables move together does not mean one causes the other.

Strong correlation (r ≈ 0.99):
  Ice cream sales and drowning rates (by month)
  
  Does ice cream cause drowning? No.
  Confounding variable: summer (warm weather → more swimming + more ice cream)
  
  Removing the confounder (season) eliminates the correlation.

Why Correlations Arise Without Causation

1. Common cause (confounding):
   A → X and A → Y, so X correlates with Y
   Example: poverty → poor diet AND poor health → diet correlates with health
   (diet doesn't necessarily cause health outcomes directly; poverty drives both)

2. Reverse causation:
   Y causes X, not X causes Y
   Example: hospitalised patients have more medication use
   Correlation: medication ~ hospitalisation
   Reality: being sick causes both hospitalisation and medication, not vice versa

3. Coincidence (spurious correlation):
   No causal mechanism at all — just happened to correlate in this dataset
   Example: number of Nicolas Cage films per year correlates with pool drownings
   Reduces with more data / external validation

4. Selection bias:
   The way data was collected creates a correlation
   Example: Berkson's paradox — in hospitalised patients, two diseases may
   appear negatively correlated even if independent in the population

Clinical ML Examples

Correlation: patients on more medications have worse outcomes
  Naive interpretation: medications cause worse outcomes → prescribe less
  Reality: sicker patients receive more medications → confounding by illness severity
  Fix: control for comorbidity score (e.g., Charlson index)

Correlation: hospitals with more ICU beds have higher mortality rates
  Naive interpretation: ICU beds cause death
  Reality: sicker patients are referred to hospitals with more ICU capacity
  Fix: risk-adjust for case mix before comparing hospitals

Correlation: model trained on EHR data → high body mass index predicts cancer
  Naive interpretation: BMI causes cancer
  Reality: overweight patients may receive more cancer screening → detection bias
  BMI correlates with cancer diagnoses, not necessarily cancer incidence

How to Investigate Causation

Randomised Controlled Trial (RCT): gold standard
  Randomly assign subjects to treatment vs control
  Randomisation breaks all confounding

  Limitation: often impractical, expensive, or unethical in medicine

Instrumental Variables:
  Find a variable that affects X but has no direct effect on Y
  Use it to isolate causal variation in X
  Example: distance to hospital as instrument for treatment (affects treatment access
  but doesn't directly affect health outcome)

Difference-in-Differences:
  Compare before/after in treated group vs control group over same period
  Removes time-constant confounders

Propensity Score Matching:
  In observational data, match treated and untreated patients on confounders
  Then compare within matched pairs

In Machine Learning

Python
# ML models learn correlations, not causation
# This has practical consequences:

# Example: A model predicts hospital readmission
# Feature: number_of_medications (highly correlated with readmission)
# 
# If a hospital reduces medications to lower predicted readmission:
#    Model accuracy may drop (distribution shift)
#    Patient outcomes may worsen (medications were there for a reason)
#    The intervention acted on a correlated feature, not a causal mechanism

# Causal ML approaches:
# 1. Include domain knowledge: don't use features that are caused by the outcome
# 2. Structural causal models: explicit DAG of causal relationships
# 3. Counterfactual reasoning: "what would have happened if we hadn't intervened?"

# Detecting confounding in feature importance:
# If a feature is important in training but not in a different hospital's data:
#   → May be a confounder specific to that hospital's patient population

def check_feature_stability(
    model,
    X_site1: "pd.DataFrame",
    X_site2: "pd.DataFrame",
    feature_names: list[str],
) -> dict:
    """Compare feature importances across sites  instability suggests confounding."""
    import numpy as np
    imp1 = model.feature_importances_  # for tree models
    
    # Retrain on site2 (or use SHAP values at each site)
    return {
        "features": feature_names,
        "site1_importance": imp1.tolist(),
        "note": "High importance in site1 but not site2 = possible confounder",
    }

A/B Testing: Establishing Causation

The standard way to establish causal effect of an ML system change:

  A/B test = randomised experiment in software
  Randomly assign users to control (A) or treatment (B)
  Measure outcome difference
  
  "The new RAG system increased user satisfaction by 12%"
  → This IS a causal claim (randomised assignment breaks confounding)

  "Users who clicked on recommendation X had higher engagement"
  → This is correlation (users who click on X may differ from those who don't)
  → Cannot claim X caused engagement without A/B test

Always design experiments for causal inference when making product decisions.
Always treat ML model outputs as correlational unless validated with an experiment.

Interview Answer

"Correlation measures how two variables move together; causation means one variable directly influences the other. The three main reasons variables correlate without causation: common cause (a third variable drives both), reverse causation (Y causes X, not vice versa), and spurious coincidence. In clinical ML, this matters critically: a model seeing that medication count correlates with poor outcomes should not lead to reducing medications — the confound is illness severity, which causes both. Establishing causation requires randomisation (RCT or A/B test) or causal inference techniques (instrumental variables, propensity score matching, difference-in-differences). ML models learn correlations; deploying them to intervene on a correlated but non-causal feature can cause harm."