Pearson Correlation Deep Dive
The mathematical derivation of Pearson correlation, its assumptions, when it fails, and how it connects to linear regression and cosine similarity.
The Formula
r = Σ[(xᵢ - x̄)(yᵢ - ȳ)] / √[Σ(xᵢ - x̄)² × Σ(yᵢ - ȳ)²]
Equivalently:
r = Cov(X, Y) / (σₓ × σᵧ)
Where:
Cov(X, Y) = (1/n) × Σ[(xᵢ - x̄)(yᵢ - ȳ)] (population)
σₓ = standard deviation of X
σᵧ = standard deviation of Y
r is Cov(X,Y) normalised to [-1, 1]Why the Formula Works
(xᵢ - x̄)(yᵢ - ȳ) is positive when:
both xᵢ and yᵢ are above their means → positive relationship
both xᵢ and yᵢ are below their means → negative × negative = positive
(xᵢ - x̄)(yᵢ - ȳ) is negative when:
xᵢ above mean, yᵢ below mean → positive × negative = negative
xᵢ below mean, yᵢ above mean → negative × positive = negative
Sum of these products:
Mostly positive → positive correlation
Mostly negative → negative correlation
Mix of both → near zero
Dividing by σₓ × σᵧ normalises to [-1, 1] regardless of scaleCovariance vs Correlation
import numpy as np
x = np.array([1, 2, 3, 4, 5], dtype=float)
y = np.array([2, 4, 5, 4, 5], dtype=float)
# Covariance (scale-dependent)
cov_xy = np.cov(x, y, ddof=1)[0, 1] # sample covariance
print(f"Cov(X,Y) = {cov_xy:.3f}")
# Correlation (scale-free, -1 to 1)
r = np.corrcoef(x, y)[0, 1]
print(f"r = {r:.3f}")
# Manual:
x_centered = x - x.mean()
y_centered = y - y.mean()
r_manual = np.sum(x_centered * y_centered) / np.sqrt(
np.sum(x_centered**2) * np.sum(y_centered**2)
)
print(f"Manual r = {r_manual:.3f}")
# Covariance matrix (for multiple features)
X = np.stack([x, y], axis=1) # shape (5, 2)
cov_matrix = np.cov(X.T, ddof=1) # shape (2, 2)
# Diagonal: variances; off-diagonal: covariancesPearson and Cosine Similarity
Pearson correlation = cosine similarity of MEAN-CENTRED vectors
Cosine similarity: cos(a, b) = (a · b) / (‖a‖ × ‖b‖)
Pearson r = cos(x - x̄, y - ȳ)
This is why:
Cosine on raw vectors → measures angle (affected by magnitude AND direction)
Pearson on raw values → measures linear relationship (removes mean first)
In NLP/RAG:
Cosine on embedding vectors (NOT mean-centred by default)
→ measures semantic similarity in embedding space
In statistics:
Pearson on data values → measures linear relationship between variablesPearson's Assumptions and When They Fail
Assumption 1: Both variables are continuous
Violated: correlation between categorical variable and continuous → use point-biserial r
Assumption 2: Linear relationship
Violated: y = x² (curved)
r could be 0 even with perfect quadratic relationship
Fix: transform variables (log, sqrt) or use Spearman
Assumption 3: No severe outliers
One outlier can dramatically change r
Example:
Without outlier: r = 0.12 (weak)
With one extreme point added: r = 0.85 (strong but artificial)
Fix: use Spearman, or remove confirmed outliers
Assumption 4: Homoscedasticity (constant variance)
Violated: variance of Y increases as X increases (funnel shape)
Fix: log-transform Y
Assumption 5: For significance testing — bivariate normal distribution
(Less important for large samples by CLT)Pearson and Linear Regression
r and the linear regression slope β₁ are related:
β₁ = r × (σᵧ / σₓ)
r = β₁ × (σₓ / σᵧ)
R² in linear regression = r² (for simple linear regression with one predictor)
Example:
r = 0.80 → R² = 0.64
64% of variance in Y is explained by X
r = 0.40 → R² = 0.16
Only 16% of variance explained
This is why R² is the "coefficient of determination" —
it's literally the square of the correlation coefficient.Significance Testing
from scipy import stats
x = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
y = np.array([2, 4, 3, 6, 5, 8, 7, 9, 10, 8])
r, p_value = stats.pearsonr(x, y)
print(f"r = {r:.3f}")
print(f"p = {p_value:.4f}")
# H0: r = 0 (no correlation)
# If p < 0.05: reject H0, correlation is statistically significant
# Confidence interval using Fisher's z-transform
def pearson_ci(r: float, n: int, alpha: float = 0.05) -> tuple[float, float]:
from scipy.special import ndtri
z = np.arctanh(r) # Fisher z-transform
se = 1 / np.sqrt(n - 3)
z_crit = ndtri(1 - alpha / 2) # e.g., 1.96 for 95%
z_lower = z - z_crit * se
z_upper = z + z_crit * se
return float(np.tanh(z_lower)), float(np.tanh(z_upper))
lower, upper = pearson_ci(r, n=len(x))
print(f"95% CI: ({lower:.3f}, {upper:.3f})")
# Note: statistical significance ≠ practical significance
# r=0.05 can be significant with n=10,000 but is practically meaninglessInterview Answer
"Pearson correlation is covariance normalised by the product of standard deviations — it measures the strength and direction of the linear relationship between two continuous variables. It equals the cosine similarity of mean-centred vectors. Key assumption: the relationship must be linear — Pearson r can be near zero for a perfect quadratic relationship. It's also sensitive to outliers. When these assumptions are violated, use Spearman rank correlation. In ML, Pearson r² equals R² in simple linear regression, quantifying the fraction of variance in Y explained by X — a direct connection between correlation and model explanatory power."
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.