Statistics & Math for AI/ML Interviews · Lesson 28 of 30
Spurious Correlations
What Spurious Means
A spurious correlation is a statistical association between two variables that has no meaningful causal connection — they correlate by coincidence or because of a hidden factor.
Classic examples:
Per capita cheese consumption vs deaths by bedsheet tangling (r=0.95)
Nicolas Cage films released vs pool drownings (r=0.87)
These are real correlations in historical data.
They are completely meaningless.
Why they occur:
1. Small datasets: in n=20, random chance produces many strong correlations
2. Many variables tested: with 100 features, ~5 will correlate with target at p<0.05 by chance
3. Time series: variables that both trend over time will correlate
4. Geographic confounders: variables that correlate because they're measured
in the same region with a shared social or environmental factorMultiple Testing Problem
If you test 100 features for correlation with your target at p < 0.05:
Expected false positives = 100 × 0.05 = 5
Five features will appear "significantly correlated" purely by chance.
This is why you need multiple testing correction.
Bonferroni correction:
Use p < α/m instead of p < α
With 100 tests and α=0.05: threshold = 0.05/100 = 0.0005
Very conservative — may miss real correlations
Benjamini-Hochberg (FDR correction):
Controls False Discovery Rate instead of Family-Wise Error Rate
Less conservative, more practical for ML feature selectionfrom scipy import stats
import numpy as np
def test_all_features(
X: np.ndarray, # (n_samples, n_features)
y: np.ndarray, # (n_samples,)
) -> list[dict]:
n_features = X.shape[1]
results = []
for i in range(n_features):
r, p = stats.pearsonr(X[:, i], y)
results.append({"feature": i, "r": r, "p_value": p})
# Benjamini-Hochberg correction
from statsmodels.stats.multitest import multipletests
p_values = [r["p_value"] for r in results]
_, p_adjusted, _, _ = multipletests(p_values, method="fdr_bh")
for i, result in enumerate(results):
result["p_adjusted"] = p_adjusted[i]
result["significant_after_correction"] = p_adjusted[i] < 0.05
return sorted(results, key=lambda x: x["p_adjusted"])In ML: Spurious Correlations Are Shortcuts
Neural networks can learn spurious correlations as shortcuts:
Classic example (chest X-ray classification):
Model achieves 95% accuracy on pneumonia detection
Discovers shortcut: many pneumonia X-rays come from portable (ICU) machines
→ Model learns "portable device" as a proxy for "sick patient"
→ On external dataset with different imaging mix, accuracy drops to 60%
Clinical examples of shortcut learning:
Model sees "ordered many labs" → predicts sepsis
(Lab ordering is a consequence of suspected sepsis, not a sign of it)
Model sees "dictated clinical note in specific format" → predicts ICU admission
(Certain physicians who dictate in that format work in the ICU)
Model sees "patient seen on weekday" → predicts elective procedure
(Weekend admits are more likely emergency — temporal confounder)How to Detect Spurious Correlations
# 1. External validation — different hospital, different time period
def evaluate_cross_site(model, site_a_data, site_b_data):
auc_site_a = evaluate_auc(model, site_a_data)
auc_site_b = evaluate_auc(model, site_b_data)
if abs(auc_site_a - auc_site_b) > 0.05:
print("Warning: large performance gap between sites — possible spurious correlation")
# 2. Feature removal analysis
def feature_ablation(model, X, y, feature_names):
baseline_auc = evaluate_auc(model, X, y)
for i, name in enumerate(feature_names):
X_ablated = X.copy()
X_ablated[:, i] = X_ablated[:, i].mean() # replace with mean
ablated_auc = evaluate_auc(model, X_ablated, y)
drop = baseline_auc - ablated_auc
if drop > 0.10:
print(f"Feature '{name}' drives large performance gain — investigate why")
# 3. Subgroup analysis
def check_subgroup_performance(model, X, y, subgroup_mask):
full_auc = evaluate_auc(model, X, y)
subgroup_auc = evaluate_auc(model, X[subgroup_mask], y[subgroup_mask])
other_auc = evaluate_auc(model, X[~subgroup_mask], y[~subgroup_mask])
if abs(subgroup_auc - other_auc) > 0.10:
print(f"Large performance gap: {subgroup_auc:.3f} vs {other_auc:.3f} — check for confounders")Clinical Importance
A spurious correlation in a consumer recommendation system:
→ Sub-optimal recommendations, minor harm
A spurious correlation in a clinical risk score:
→ Wrong drug doses, wrong triage → patient harm
Red flags for spurious clinical correlations:
Feature is a proxy for care process, not patient state
(e.g., labs ordered, beds available, day of week)
Feature correlates differently across patient subgroups
Feature doesn't make biological/clinical sense
Performance drops significantly when tested externally
Mitigation:
Clinical expert review of top features before deployment
External validation on held-out institution data
Prospective validation before clinical use
Regular post-deployment monitoring for performance driftInterview Answer
"Spurious correlations are statistical associations with no causal mechanism — arising from coincidence, small sample sizes, multiple testing without correction, or hidden confounders. In ML, the danger is shortcut learning: models discover spurious features that correlate with the label in training data but fail at deployment. Examples in clinical ML include models learning 'portable imaging device' as a proxy for 'ICU patient' rather than the actual pathology. Detection: external validation at a different site is the gold standard test — performance drops of more than 5–10% AUC suggest spurious features. Always apply multiple testing correction (Benjamini-Hochberg FDR) when evaluating many features, and validate with domain experts that each important feature has a plausible causal mechanism."