Spurious Correlations — Statistics & Math for AI/ML Interviews | Learnixo

What Spurious Means

A spurious correlation is a statistical association between two variables that has no meaningful causal connection — they correlate by coincidence or because of a hidden factor.

Classic examples:
  Per capita cheese consumption vs deaths by bedsheet tangling (r=0.95)
  Nicolas Cage films released vs pool drownings (r=0.87)
  
  These are real correlations in historical data.
  They are completely meaningless.

Why they occur:
  1. Small datasets: in n=20, random chance produces many strong correlations
  2. Many variables tested: with 100 features, ~5 will correlate with target at p<0.05 by chance
  3. Time series: variables that both trend over time will correlate
  4. Geographic confounders: variables that correlate because they're measured
     in the same region with a shared social or environmental factor

Multiple Testing Problem

If you test 100 features for correlation with your target at p < 0.05:

Expected false positives = 100 × 0.05 = 5

Five features will appear "significantly correlated" purely by chance.
This is why you need multiple testing correction.

Bonferroni correction:
  Use p < α/m instead of p < α
  With 100 tests and α=0.05: threshold = 0.05/100 = 0.0005
  
  Very conservative — may miss real correlations

Benjamini-Hochberg (FDR correction):
  Controls False Discovery Rate instead of Family-Wise Error Rate
  Less conservative, more practical for ML feature selection

Python

from scipy import stats
import numpy as np

def test_all_features(
    X: np.ndarray,    # (n_samples, n_features)
    y: np.ndarray,    # (n_samples,)
) -> list[dict]:
    n_features = X.shape[1]
    results = []
    
    for i in range(n_features):
        r, p = stats.pearsonr(X[:, i], y)
        results.append({"feature": i, "r": r, "p_value": p})
    
    # Benjamini-Hochberg correction
    from statsmodels.stats.multitest import multipletests
    p_values = [r["p_value"] for r in results]
    _, p_adjusted, _, _ = multipletests(p_values, method="fdr_bh")
    
    for i, result in enumerate(results):
        result["p_adjusted"] = p_adjusted[i]
        result["significant_after_correction"] = p_adjusted[i] < 0.05
    
    return sorted(results, key=lambda x: x["p_adjusted"])

In ML: Spurious Correlations Are Shortcuts

Neural networks can learn spurious correlations as shortcuts:

Classic example (chest X-ray classification):
  Model achieves 95% accuracy on pneumonia detection
  Discovers shortcut: many pneumonia X-rays come from portable (ICU) machines
  → Model learns "portable device" as a proxy for "sick patient"
  → On external dataset with different imaging mix, accuracy drops to 60%

Clinical examples of shortcut learning:
  Model sees "ordered many labs" → predicts sepsis
  (Lab ordering is a consequence of suspected sepsis, not a sign of it)

  Model sees "dictated clinical note in specific format" → predicts ICU admission
  (Certain physicians who dictate in that format work in the ICU)
  
  Model sees "patient seen on weekday" → predicts elective procedure
  (Weekend admits are more likely emergency — temporal confounder)

How to Detect Spurious Correlations

Python

# 1. External validation — different hospital, different time period
def evaluate_cross_site(model, site_a_data, site_b_data):
    auc_site_a = evaluate_auc(model, site_a_data)
    auc_site_b = evaluate_auc(model, site_b_data)
    
    if abs(auc_site_a - auc_site_b) > 0.05:
        print("Warning: large performance gap between sites — possible spurious correlation")

# 2. Feature removal analysis
def feature_ablation(model, X, y, feature_names):
    baseline_auc = evaluate_auc(model, X, y)
    for i, name in enumerate(feature_names):
        X_ablated = X.copy()
        X_ablated[:, i] = X_ablated[:, i].mean()  # replace with mean
        ablated_auc = evaluate_auc(model, X_ablated, y)
        drop = baseline_auc - ablated_auc
        if drop > 0.10:
            print(f"Feature '{name}' drives large performance gain — investigate why")

# 3. Subgroup analysis
def check_subgroup_performance(model, X, y, subgroup_mask):
    full_auc = evaluate_auc(model, X, y)
    subgroup_auc = evaluate_auc(model, X[subgroup_mask], y[subgroup_mask])
    other_auc = evaluate_auc(model, X[~subgroup_mask], y[~subgroup_mask])
    
    if abs(subgroup_auc - other_auc) > 0.10:
        print(f"Large performance gap: {subgroup_auc:.3f} vs {other_auc:.3f} — check for confounders")

Clinical Importance

A spurious correlation in a consumer recommendation system:
  → Sub-optimal recommendations, minor harm

A spurious correlation in a clinical risk score:
  → Wrong drug doses, wrong triage → patient harm

Red flags for spurious clinical correlations:
  Feature is a proxy for care process, not patient state
    (e.g., labs ordered, beds available, day of week)
  Feature correlates differently across patient subgroups
  Feature doesn't make biological/clinical sense
  Performance drops significantly when tested externally

Mitigation:
  Clinical expert review of top features before deployment
  External validation on held-out institution data
  Prospective validation before clinical use
  Regular post-deployment monitoring for performance drift

Interview Answer

"Spurious correlations are statistical associations with no causal mechanism — arising from coincidence, small sample sizes, multiple testing without correction, or hidden confounders. In ML, the danger is shortcut learning: models discover spurious features that correlate with the label in training data but fail at deployment. Examples in clinical ML include models learning 'portable imaging device' as a proxy for 'ICU patient' rather than the actual pathology. Detection: external validation at a different site is the gold standard test — performance drops of more than 5–10% AUC suggest spurious features. Always apply multiple testing correction (Benjamini-Hochberg FDR) when evaluating many features, and validate with domain experts that each important feature has a plausible causal mechanism."