Learnixo

Statistics & Math for AI/ML Interviews · Lesson 26 of 30

Pearson Correlation Coefficient

The Formula

r = Σ[(xᵢ - x̄)(yᵢ - ȳ)] / √[Σ(xᵢ - x̄)² × Σ(yᵢ - ȳ)²]

Equivalently:
r = Cov(X, Y) / (σₓ × σᵧ)

Where:
  Cov(X, Y) = (1/n) × Σ[(xᵢ - x̄)(yᵢ - ȳ)]  (population)
  σₓ = standard deviation of X
  σᵧ = standard deviation of Y

r is Cov(X,Y) normalised to [-1, 1]

Why the Formula Works

(xᵢ - x̄)(yᵢ - ȳ) is positive when:
  both xᵢ and yᵢ are above their means → positive relationship
  both xᵢ and yᵢ are below their means → negative × negative = positive

(xᵢ - x̄)(yᵢ - ȳ) is negative when:
  xᵢ above mean, yᵢ below mean → positive × negative = negative
  xᵢ below mean, yᵢ above mean → negative × positive = negative

Sum of these products:
  Mostly positive → positive correlation
  Mostly negative → negative correlation
  Mix of both → near zero

Dividing by σₓ × σᵧ normalises to [-1, 1] regardless of scale

Covariance vs Correlation

Python
import numpy as np

x = np.array([1, 2, 3, 4, 5], dtype=float)
y = np.array([2, 4, 5, 4, 5], dtype=float)

# Covariance (scale-dependent)
cov_xy = np.cov(x, y, ddof=1)[0, 1]  # sample covariance
print(f"Cov(X,Y) = {cov_xy:.3f}")

# Correlation (scale-free, -1 to 1)
r = np.corrcoef(x, y)[0, 1]
print(f"r = {r:.3f}")

# Manual:
x_centered = x - x.mean()
y_centered = y - y.mean()
r_manual = np.sum(x_centered * y_centered) / np.sqrt(
    np.sum(x_centered**2) * np.sum(y_centered**2)
)
print(f"Manual r = {r_manual:.3f}")

# Covariance matrix (for multiple features)
X = np.stack([x, y], axis=1)  # shape (5, 2)
cov_matrix = np.cov(X.T, ddof=1)   # shape (2, 2)
# Diagonal: variances; off-diagonal: covariances

Pearson and Cosine Similarity

Pearson correlation = cosine similarity of MEAN-CENTRED vectors

Cosine similarity: cos(a, b) = (a · b) / (‖a‖ × ‖b‖)

Pearson r = cos(x - x̄, y - ȳ)

This is why:
  Cosine on raw vectors → measures angle (affected by magnitude AND direction)
  Pearson on raw values → measures linear relationship (removes mean first)

In NLP/RAG:
  Cosine on embedding vectors (NOT mean-centred by default)
  → measures semantic similarity in embedding space

In statistics:
  Pearson on data values → measures linear relationship between variables

Pearson's Assumptions and When They Fail

Assumption 1: Both variables are continuous
  Violated: correlation between categorical variable and continuous → use point-biserial r

Assumption 2: Linear relationship
  Violated: y = x² (curved)
  r could be 0 even with perfect quadratic relationship
  
  Fix: transform variables (log, sqrt) or use Spearman

Assumption 3: No severe outliers
  One outlier can dramatically change r
  Example:
    Without outlier: r = 0.12 (weak)
    With one extreme point added: r = 0.85 (strong but artificial)
  Fix: use Spearman, or remove confirmed outliers

Assumption 4: Homoscedasticity (constant variance)
  Violated: variance of Y increases as X increases (funnel shape)
  Fix: log-transform Y

Assumption 5: For significance testing — bivariate normal distribution
  (Less important for large samples by CLT)

Pearson and Linear Regression

r and the linear regression slope β₁ are related:
  β₁ = r × (σᵧ / σₓ)
  r = β₁ × (σₓ / σᵧ)

R² in linear regression = r² (for simple linear regression with one predictor)

Example:
  r = 0.80 → R² = 0.64
  64% of variance in Y is explained by X

  r = 0.40 → R² = 0.16
  Only 16% of variance explained

This is why R² is the "coefficient of determination" —
it's literally the square of the correlation coefficient.

Significance Testing

Python
from scipy import stats

x = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
y = np.array([2, 4, 3, 6, 5, 8, 7, 9, 10, 8])

r, p_value = stats.pearsonr(x, y)
print(f"r = {r:.3f}")
print(f"p = {p_value:.4f}")

# H0: r = 0 (no correlation)
# If p < 0.05: reject H0, correlation is statistically significant

# Confidence interval using Fisher's z-transform
def pearson_ci(r: float, n: int, alpha: float = 0.05) -> tuple[float, float]:
    from scipy.special import ndtri
    
    z = np.arctanh(r)   # Fisher z-transform
    se = 1 / np.sqrt(n - 3)
    z_crit = ndtri(1 - alpha / 2)  # e.g., 1.96 for 95%
    
    z_lower = z - z_crit * se
    z_upper = z + z_crit * se
    
    return float(np.tanh(z_lower)), float(np.tanh(z_upper))

lower, upper = pearson_ci(r, n=len(x))
print(f"95% CI: ({lower:.3f}, {upper:.3f})")

# Note: statistical significance ≠ practical significance
# r=0.05 can be significant with n=10,000 but is practically meaningless

Interview Answer

"Pearson correlation is covariance normalised by the product of standard deviations — it measures the strength and direction of the linear relationship between two continuous variables. It equals the cosine similarity of mean-centred vectors. Key assumption: the relationship must be linear — Pearson r can be near zero for a perfect quadratic relationship. It's also sensitive to outliers. When these assumptions are violated, use Spearman rank correlation. In ML, Pearson r² equals R² in simple linear regression, quantifying the fraction of variance in Y explained by X — a direct connection between correlation and model explanatory power."