Pearson Correlation Deep Dive

The Formula

r = Σ[(xᵢ - x̄)(yᵢ - ȳ)] / √[Σ(xᵢ - x̄)² × Σ(yᵢ - ȳ)²]

Equivalently:
r = Cov(X, Y) / (σₓ × σᵧ)

Where:
  Cov(X, Y) = (1/n) × Σ[(xᵢ - x̄)(yᵢ - ȳ)]  (population)
  σₓ = standard deviation of X
  σᵧ = standard deviation of Y

r is Cov(X,Y) normalised to [-1, 1]

Why the Formula Works

(xᵢ - x̄)(yᵢ - ȳ) is positive when:
  both xᵢ and yᵢ are above their means → positive relationship
  both xᵢ and yᵢ are below their means → negative × negative = positive

(xᵢ - x̄)(yᵢ - ȳ) is negative when:
  xᵢ above mean, yᵢ below mean → positive × negative = negative
  xᵢ below mean, yᵢ above mean → negative × positive = negative

Sum of these products:
  Mostly positive → positive correlation
  Mostly negative → negative correlation
  Mix of both → near zero

Dividing by σₓ × σᵧ normalises to [-1, 1] regardless of scale

Covariance vs Correlation

Python

import numpy as np

x = np.array([1, 2, 3, 4, 5], dtype=float)
y = np.array([2, 4, 5, 4, 5], dtype=float)

# Covariance (scale-dependent)
cov_xy = np.cov(x, y, ddof=1)[0, 1]  # sample covariance
print(f"Cov(X,Y) = {cov_xy:.3f}")

# Correlation (scale-free, -1 to 1)
r = np.corrcoef(x, y)[0, 1]
print(f"r = {r:.3f}")

# Manual:
x_centered = x - x.mean()
y_centered = y - y.mean()
r_manual = np.sum(x_centered * y_centered) / np.sqrt(
    np.sum(x_centered**2) * np.sum(y_centered**2)
)
print(f"Manual r = {r_manual:.3f}")

# Covariance matrix (for multiple features)
X = np.stack([x, y], axis=1)  # shape (5, 2)
cov_matrix = np.cov(X.T, ddof=1)   # shape (2, 2)
# Diagonal: variances; off-diagonal: covariances

Pearson and Cosine Similarity

Pearson correlation = cosine similarity of MEAN-CENTRED vectors

Cosine similarity: cos(a, b) = (a · b) / (‖a‖ × ‖b‖)

Pearson r = cos(x - x̄, y - ȳ)

This is why:
  Cosine on raw vectors → measures angle (affected by magnitude AND direction)
  Pearson on raw values → measures linear relationship (removes mean first)

In NLP/RAG:
  Cosine on embedding vectors (NOT mean-centred by default)
  → measures semantic similarity in embedding space

In statistics:
  Pearson on data values → measures linear relationship between variables

Pearson's Assumptions and When They Fail

Assumption 1: Both variables are continuous
  Violated: correlation between categorical variable and continuous → use point-biserial r

Assumption 2: Linear relationship
  Violated: y = x² (curved)
  r could be 0 even with perfect quadratic relationship
  
  Fix: transform variables (log, sqrt) or use Spearman

Assumption 3: No severe outliers
  One outlier can dramatically change r
  Example:
    Without outlier: r = 0.12 (weak)
    With one extreme point added: r = 0.85 (strong but artificial)
  Fix: use Spearman, or remove confirmed outliers

Assumption 4: Homoscedasticity (constant variance)
  Violated: variance of Y increases as X increases (funnel shape)
  Fix: log-transform Y

Assumption 5: For significance testing — bivariate normal distribution
  (Less important for large samples by CLT)

Pearson and Linear Regression

r and the linear regression slope β₁ are related:
  β₁ = r × (σᵧ / σₓ)
  r = β₁ × (σₓ / σᵧ)

R² in linear regression = r² (for simple linear regression with one predictor)

Example:
  r = 0.80 → R² = 0.64
  64% of variance in Y is explained by X

  r = 0.40 → R² = 0.16
  Only 16% of variance explained

This is why R² is the "coefficient of determination" —
it's literally the square of the correlation coefficient.

Significance Testing

Python

from scipy import stats

x = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
y = np.array([2, 4, 3, 6, 5, 8, 7, 9, 10, 8])

r, p_value = stats.pearsonr(x, y)
print(f"r = {r:.3f}")
print(f"p = {p_value:.4f}")

# H0: r = 0 (no correlation)
# If p < 0.05: reject H0, correlation is statistically significant

# Confidence interval using Fisher's z-transform
def pearson_ci(r: float, n: int, alpha: float = 0.05) -> tuple[float, float]:
    from scipy.special import ndtri
    
    z = np.arctanh(r)   # Fisher z-transform
    se = 1 / np.sqrt(n - 3)
    z_crit = ndtri(1 - alpha / 2)  # e.g., 1.96 for 95%
    
    z_lower = z - z_crit * se
    z_upper = z + z_crit * se
    
    return float(np.tanh(z_lower)), float(np.tanh(z_upper))

lower, upper = pearson_ci(r, n=len(x))
print(f"95% CI: ({lower:.3f}, {upper:.3f})")

# Note: statistical significance ≠ practical significance
# r=0.05 can be significant with n=10,000 but is practically meaningless

Interview Answer

"Pearson correlation is covariance normalised by the product of standard deviations — it measures the strength and direction of the linear relationship between two continuous variables. It equals the cosine similarity of mean-centred vectors. Key assumption: the relationship must be linear — Pearson r can be near zero for a perfect quadratic relationship. It's also sensitive to outliers. When these assumptions are violated, use Spearman rank correlation. In ML, Pearson r² equals R² in simple linear regression, quantifying the fraction of variance in Y explained by X — a direct connection between correlation and model explanatory power."

Pearson Correlation Deep Dive

The Formula

Why the Formula Works

Covariance vs Correlation

Pearson and Cosine Similarity

Pearson's Assumptions and When They Fail

Pearson and Linear Regression

Significance Testing

Interview Answer

Enjoyed this article?

Leave a comment