Correlation — Statistics & Math for AI/ML Interviews | Learnixo

What Correlation Measures

Correlation quantifies the strength and direction of the linear relationship between two variables:

r = +1.0: perfect positive linear relationship
           As X increases, Y increases proportionally

r = 0.0:  no linear relationship
           X tells you nothing about Y (linearly)

r = -1.0: perfect negative linear relationship
           As X increases, Y decreases proportionally

r ∈ (-1, 1) in practice — exactly ±1 only in deterministic relationships

Pearson Correlation

Measures linear relationship between two continuous variables.

r = Σ[(xᵢ - x̄)(yᵢ - ȳ)] / [n × σₓ × σᵧ]

Assumptions:
  Both variables are continuous
  Relationship is approximately linear
  No severe outliers (outliers distort r)
  Both roughly normally distributed (for significance testing)

Python

import numpy as np
from scipy import stats
import pandas as pd

x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 5, 4, 5])

# Pearson correlation
r, p_value = stats.pearsonr(x, y)
print(f"Pearson r = {r:.3f}, p = {p_value:.4f}")

# NumPy
r_matrix = np.corrcoef(x, y)  # 2×2 correlation matrix
r = r_matrix[0, 1]

# DataFrame
df = pd.DataFrame({"x": x, "y": y, "z": [5, 3, 2, 1, 4]})
print(df.corr(method="pearson"))  # correlation matrix

Spearman Rank Correlation

Measures monotonic relationship (not necessarily linear). Robust to outliers and non-normality.

Spearman ρ = Pearson correlation of the RANKS of x and y

Convert values to ranks, then compute Pearson on the ranks.
Captures: monotonic relationships (as X increases, Y tends to increase)
Misses: non-monotonic relationships (U-shaped, etc.)

Python

# Spearman — when data is ordinal, or when Pearson assumptions are violated
rho, p_value = stats.spearmanr(x, y)
print(f"Spearman ρ = {rho:.3f}, p = {p_value:.4f}")

# Example where Spearman > Pearson (non-linear but monotonic)
x_nonlinear = np.array([1, 2, 3, 4, 5])
y_exponential = np.array([1, 4, 9, 16, 25])  # x²

r_pearson, _ = stats.pearsonr(x_nonlinear, y_exponential)
r_spearman, _ = stats.spearmanr(x_nonlinear, y_exponential)
print(f"Pearson: {r_pearson:.3f}")   # 0.975 — high but not 1.0 (non-linear)
print(f"Spearman: {r_spearman:.3f}") # 1.000 — perfect monotonic relationship

Interpreting Correlation Strength

|r|         | Interpretation
------------|-----------------------------
0.00–0.10   | Negligible (no relationship)
0.10–0.30   | Weak
0.30–0.50   | Moderate
0.50–0.70   | Strong
0.70–0.90   | Very strong
0.90–1.00   | Nearly perfect

Context matters:
  r=0.3 in social science: meaningful
  r=0.3 in physics: very weak (physical laws are near-perfect)
  r=0.3 in medical prediction: potentially useful for a feature

Feature Selection with Correlation

Python

# Find highly correlated features (potential multicollinearity)
def get_high_correlation_pairs(
    df: pd.DataFrame,
    threshold: float = 0.9,
) -> list[tuple]:
    corr_matrix = df.corr(method="pearson").abs()
    
    # Upper triangle only (avoid duplicates)
    pairs = []
    cols = corr_matrix.columns
    for i in range(len(cols)):
        for j in range(i + 1, len(cols)):
            if corr_matrix.iloc[i, j] >= threshold:
                pairs.append((cols[i], cols[j], corr_matrix.iloc[i, j]))
    
    return sorted(pairs, key=lambda x: -x[2])


# Filter features by correlation with target
def select_features_by_correlation(
    X: pd.DataFrame,
    y: pd.Series,
    min_correlation: float = 0.1,
    method: str = "spearman",
) -> list[str]:
    correlations = X.apply(lambda col: col.corr(y, method=method)).abs()
    return correlations[correlations >= min_correlation].index.tolist()


# Correlation heatmap (for exploration)
import matplotlib.pyplot as plt
import seaborn as sns

def plot_correlation_matrix(df: pd.DataFrame) -> None:
    plt.figure(figsize=(10, 8))
    sns.heatmap(
        df.corr(),
        annot=True, fmt=".2f",
        cmap="RdBu_r", center=0,
        vmin=-1, vmax=1,
    )
    plt.title("Feature Correlation Matrix")
    plt.tight_layout()

Multicollinearity in Models

High correlation between input features → multicollinearity

Effect on linear models:
  Coefficients become unstable (high variance)
  Hard to interpret which feature drives the prediction

Effect on tree models:
  Less severe — trees can handle correlated features
  But feature importance becomes unreliable

Detection: variance inflation factor (VIF)
  VIF = 1/(1 - R²) where R² is from regressing that feature on all others
  VIF > 5: concerning; VIF > 10: serious multicollinearity

Fix:
  Remove one of the highly correlated features
  PCA to orthogonalise features
  Ridge regression (L2 regularisation handles multicollinearity well)

Interview Answer

"Pearson correlation measures linear relationships between continuous variables (r from -1 to +1). Spearman correlation measures monotonic relationships using ranks — more robust to outliers and non-linearity. In ML: I use correlation for feature selection (filter features with near-zero correlation to the target), for detecting multicollinearity (highly correlated feature pairs can make linear model coefficients unstable), and for exploratory data analysis. The critical caveat: correlation ≠ causation. A strong correlation between two features might be because they share a common cause, not because one drives the other."