Learnixo
Back to blog
AI Systemsintermediate

Pearson Correlation Deep Dive

The mathematical derivation of Pearson correlation, its assumptions, when it fails, and how it connects to linear regression and cosine similarity.

Asma Hafeez KhanMay 21, 20265 min read
StatisticsPearsonCorrelationLinear RegressionInterview
Share:𝕏

The Formula

r = Σ[(xᵢ - x̄)(yᵢ - ȳ)] / √[Σ(xᵢ - x̄)² × Σ(yᵢ - ȳ)²]

Equivalently:
r = Cov(X, Y) / (σₓ × σᵧ)

Where:
  Cov(X, Y) = (1/n) × Σ[(xᵢ - x̄)(yᵢ - ȳ)]  (population)
  σₓ = standard deviation of X
  σᵧ = standard deviation of Y

r is Cov(X,Y) normalised to [-1, 1]

Why the Formula Works

(xᵢ - x̄)(yᵢ - ȳ) is positive when:
  both xᵢ and yᵢ are above their means → positive relationship
  both xᵢ and yᵢ are below their means → negative × negative = positive

(xᵢ - x̄)(yᵢ - ȳ) is negative when:
  xᵢ above mean, yᵢ below mean → positive × negative = negative
  xᵢ below mean, yᵢ above mean → negative × positive = negative

Sum of these products:
  Mostly positive → positive correlation
  Mostly negative → negative correlation
  Mix of both → near zero

Dividing by σₓ × σᵧ normalises to [-1, 1] regardless of scale

Covariance vs Correlation

Python
import numpy as np

x = np.array([1, 2, 3, 4, 5], dtype=float)
y = np.array([2, 4, 5, 4, 5], dtype=float)

# Covariance (scale-dependent)
cov_xy = np.cov(x, y, ddof=1)[0, 1]  # sample covariance
print(f"Cov(X,Y) = {cov_xy:.3f}")

# Correlation (scale-free, -1 to 1)
r = np.corrcoef(x, y)[0, 1]
print(f"r = {r:.3f}")

# Manual:
x_centered = x - x.mean()
y_centered = y - y.mean()
r_manual = np.sum(x_centered * y_centered) / np.sqrt(
    np.sum(x_centered**2) * np.sum(y_centered**2)
)
print(f"Manual r = {r_manual:.3f}")

# Covariance matrix (for multiple features)
X = np.stack([x, y], axis=1)  # shape (5, 2)
cov_matrix = np.cov(X.T, ddof=1)   # shape (2, 2)
# Diagonal: variances; off-diagonal: covariances

Pearson and Cosine Similarity

Pearson correlation = cosine similarity of MEAN-CENTRED vectors

Cosine similarity: cos(a, b) = (a · b) / (‖a‖ × ‖b‖)

Pearson r = cos(x - x̄, y - ȳ)

This is why:
  Cosine on raw vectors → measures angle (affected by magnitude AND direction)
  Pearson on raw values → measures linear relationship (removes mean first)

In NLP/RAG:
  Cosine on embedding vectors (NOT mean-centred by default)
  → measures semantic similarity in embedding space

In statistics:
  Pearson on data values → measures linear relationship between variables

Pearson's Assumptions and When They Fail

Assumption 1: Both variables are continuous
  Violated: correlation between categorical variable and continuous → use point-biserial r

Assumption 2: Linear relationship
  Violated: y = x² (curved)
  r could be 0 even with perfect quadratic relationship
  
  Fix: transform variables (log, sqrt) or use Spearman

Assumption 3: No severe outliers
  One outlier can dramatically change r
  Example:
    Without outlier: r = 0.12 (weak)
    With one extreme point added: r = 0.85 (strong but artificial)
  Fix: use Spearman, or remove confirmed outliers

Assumption 4: Homoscedasticity (constant variance)
  Violated: variance of Y increases as X increases (funnel shape)
  Fix: log-transform Y

Assumption 5: For significance testing — bivariate normal distribution
  (Less important for large samples by CLT)

Pearson and Linear Regression

r and the linear regression slope β₁ are related:
  β₁ = r × (σᵧ / σₓ)
  r = β₁ × (σₓ / σᵧ)

R² in linear regression = r² (for simple linear regression with one predictor)

Example:
  r = 0.80 → R² = 0.64
  64% of variance in Y is explained by X

  r = 0.40 → R² = 0.16
  Only 16% of variance explained

This is why R² is the "coefficient of determination" —
it's literally the square of the correlation coefficient.

Significance Testing

Python
from scipy import stats

x = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
y = np.array([2, 4, 3, 6, 5, 8, 7, 9, 10, 8])

r, p_value = stats.pearsonr(x, y)
print(f"r = {r:.3f}")
print(f"p = {p_value:.4f}")

# H0: r = 0 (no correlation)
# If p < 0.05: reject H0, correlation is statistically significant

# Confidence interval using Fisher's z-transform
def pearson_ci(r: float, n: int, alpha: float = 0.05) -> tuple[float, float]:
    from scipy.special import ndtri
    
    z = np.arctanh(r)   # Fisher z-transform
    se = 1 / np.sqrt(n - 3)
    z_crit = ndtri(1 - alpha / 2)  # e.g., 1.96 for 95%
    
    z_lower = z - z_crit * se
    z_upper = z + z_crit * se
    
    return float(np.tanh(z_lower)), float(np.tanh(z_upper))

lower, upper = pearson_ci(r, n=len(x))
print(f"95% CI: ({lower:.3f}, {upper:.3f})")

# Note: statistical significance ≠ practical significance
# r=0.05 can be significant with n=10,000 but is practically meaningless

Interview Answer

"Pearson correlation is covariance normalised by the product of standard deviations — it measures the strength and direction of the linear relationship between two continuous variables. It equals the cosine similarity of mean-centred vectors. Key assumption: the relationship must be linear — Pearson r can be near zero for a perfect quadratic relationship. It's also sensitive to outliers. When these assumptions are violated, use Spearman rank correlation. In ML, Pearson r² equals R² in simple linear regression, quantifying the fraction of variance in Y explained by X — a direct connection between correlation and model explanatory power."

Enjoyed this article?

Explore the AI Systems learning path for more.

Found this helpful?

Share:𝕏

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.