Statistics & Math for AI/ML Interviews · Lesson 25 of 30
Correlation
What Correlation Measures
Correlation quantifies the strength and direction of the linear relationship between two variables:
r = +1.0: perfect positive linear relationship
As X increases, Y increases proportionally
r = 0.0: no linear relationship
X tells you nothing about Y (linearly)
r = -1.0: perfect negative linear relationship
As X increases, Y decreases proportionally
r ∈ (-1, 1) in practice — exactly ±1 only in deterministic relationshipsPearson Correlation
Measures linear relationship between two continuous variables.
r = Σ[(xᵢ - x̄)(yᵢ - ȳ)] / [n × σₓ × σᵧ]
Assumptions:
Both variables are continuous
Relationship is approximately linear
No severe outliers (outliers distort r)
Both roughly normally distributed (for significance testing)import numpy as np
from scipy import stats
import pandas as pd
x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 5, 4, 5])
# Pearson correlation
r, p_value = stats.pearsonr(x, y)
print(f"Pearson r = {r:.3f}, p = {p_value:.4f}")
# NumPy
r_matrix = np.corrcoef(x, y) # 2×2 correlation matrix
r = r_matrix[0, 1]
# DataFrame
df = pd.DataFrame({"x": x, "y": y, "z": [5, 3, 2, 1, 4]})
print(df.corr(method="pearson")) # correlation matrixSpearman Rank Correlation
Measures monotonic relationship (not necessarily linear). Robust to outliers and non-normality.
Spearman ρ = Pearson correlation of the RANKS of x and y
Convert values to ranks, then compute Pearson on the ranks.
Captures: monotonic relationships (as X increases, Y tends to increase)
Misses: non-monotonic relationships (U-shaped, etc.)# Spearman — when data is ordinal, or when Pearson assumptions are violated
rho, p_value = stats.spearmanr(x, y)
print(f"Spearman ρ = {rho:.3f}, p = {p_value:.4f}")
# Example where Spearman > Pearson (non-linear but monotonic)
x_nonlinear = np.array([1, 2, 3, 4, 5])
y_exponential = np.array([1, 4, 9, 16, 25]) # x²
r_pearson, _ = stats.pearsonr(x_nonlinear, y_exponential)
r_spearman, _ = stats.spearmanr(x_nonlinear, y_exponential)
print(f"Pearson: {r_pearson:.3f}") # 0.975 — high but not 1.0 (non-linear)
print(f"Spearman: {r_spearman:.3f}") # 1.000 — perfect monotonic relationshipInterpreting Correlation Strength
|r| | Interpretation
------------|-----------------------------
0.00–0.10 | Negligible (no relationship)
0.10–0.30 | Weak
0.30–0.50 | Moderate
0.50–0.70 | Strong
0.70–0.90 | Very strong
0.90–1.00 | Nearly perfect
Context matters:
r=0.3 in social science: meaningful
r=0.3 in physics: very weak (physical laws are near-perfect)
r=0.3 in medical prediction: potentially useful for a featureFeature Selection with Correlation
# Find highly correlated features (potential multicollinearity)
def get_high_correlation_pairs(
df: pd.DataFrame,
threshold: float = 0.9,
) -> list[tuple]:
corr_matrix = df.corr(method="pearson").abs()
# Upper triangle only (avoid duplicates)
pairs = []
cols = corr_matrix.columns
for i in range(len(cols)):
for j in range(i + 1, len(cols)):
if corr_matrix.iloc[i, j] >= threshold:
pairs.append((cols[i], cols[j], corr_matrix.iloc[i, j]))
return sorted(pairs, key=lambda x: -x[2])
# Filter features by correlation with target
def select_features_by_correlation(
X: pd.DataFrame,
y: pd.Series,
min_correlation: float = 0.1,
method: str = "spearman",
) -> list[str]:
correlations = X.apply(lambda col: col.corr(y, method=method)).abs()
return correlations[correlations >= min_correlation].index.tolist()
# Correlation heatmap (for exploration)
import matplotlib.pyplot as plt
import seaborn as sns
def plot_correlation_matrix(df: pd.DataFrame) -> None:
plt.figure(figsize=(10, 8))
sns.heatmap(
df.corr(),
annot=True, fmt=".2f",
cmap="RdBu_r", center=0,
vmin=-1, vmax=1,
)
plt.title("Feature Correlation Matrix")
plt.tight_layout()Multicollinearity in Models
High correlation between input features → multicollinearity
Effect on linear models:
Coefficients become unstable (high variance)
Hard to interpret which feature drives the prediction
Effect on tree models:
Less severe — trees can handle correlated features
But feature importance becomes unreliable
Detection: variance inflation factor (VIF)
VIF = 1/(1 - R²) where R² is from regressing that feature on all others
VIF > 5: concerning; VIF > 10: serious multicollinearity
Fix:
Remove one of the highly correlated features
PCA to orthogonalise features
Ridge regression (L2 regularisation handles multicollinearity well)Interview Answer
"Pearson correlation measures linear relationships between continuous variables (r from -1 to +1). Spearman correlation measures monotonic relationships using ranks — more robust to outliers and non-linearity. In ML: I use correlation for feature selection (filter features with near-zero correlation to the target), for detecting multicollinearity (highly correlated feature pairs can make linear model coefficients unstable), and for exploratory data analysis. The critical caveat: correlation ≠ causation. A strong correlation between two features might be because they share a common cause, not because one drives the other."