Learnixo

Statistics & Math for AI/ML Interviews · Lesson 22 of 30

Normal Distribution

The Formula

f(x) = (1 / (σ√(2π))) × exp(-(x - μ)² / (2σ²))

Parameters:
  μ = mean (centre of the bell curve)
  σ = standard deviation (width of the bell curve)
  σ² = variance

Standard normal: μ = 0, σ = 1  →  denoted N(0, 1)
General normal: X ~ N(μ, σ²)

The 68-95-99.7 Rule

For X ~ N(μ, σ²):
  P(μ - σ ≤ X ≤ μ + σ)   ≈ 0.683  (68.3%)
  P(μ - 2σ ≤ X ≤ μ + 2σ) ≈ 0.954  (95.4%)
  P(μ - 3σ ≤ X ≤ μ + 3σ) ≈ 0.997  (99.7%)

Practical use:
  INR ~ N(2.5, 0.25)  (mean 2.5, std 0.5)
  68% of readings between 2.0 and 3.0 (therapeutic range)
  95% between 1.5 and 3.5
  Values beyond 3σ from mean (<1.0 or >4.0) are very unusual

Z-Scores and Standardisation

z = (x - μ) / σ

Z-score: how many standard deviations x is from the mean
  z = 0: exactly at the mean
  z = 1: one std above mean
  z = -2: two stds below mean

Any N(μ, σ²) variable can be standardised to N(0, 1):
  If X ~ N(μ, σ²) then Z = (X - μ)/σ ~ N(0, 1)
Python
import numpy as np
from scipy.stats import norm

# INR example
mu_inr, sigma_inr = 2.5, 0.5

def z_score(x: float, mu: float, sigma: float) -> float:
    return (x - mu) / sigma

inr_value = 3.8
z = z_score(inr_value, mu_inr, sigma_inr)
print(f"INR = {inr_value}: z = {z:.2f}")  # z = 2.6

# Probability of seeing INR  3.8
p_above = norm(mu_inr, sigma_inr).sf(inr_value)
print(f"P(INR ≥ {inr_value}) = {p_above:.4f}")  # 0.0047  rare

# Feature standardisation in ML
X = np.random.normal(loc=5, scale=2, size=(1000, 10))
X_standardised = (X - X.mean(axis=0)) / X.std(axis=0, ddof=1)
print(f"After standardisation: mean≈{X_standardised.mean(axis=0).round(2)}, std≈{X_standardised.std(axis=0, ddof=1).round(2)}")

The Central Limit Theorem (CLT)

The CLT is the most important theorem in statistics for ML:

If X₁, X₂, ..., Xₙ are i.i.d. with mean μ and variance σ²,
then as n → ∞:

  x̄ = (1/n) Σ Xᵢ ~ N(μ, σ²/n)   (approximately)

In words: the SAMPLE MEAN is approximately normally distributed,
regardless of the underlying distribution, when n is large enough.

Practical implication:
  You can use normal distribution theory for means and averages
  even when the underlying data is not normally distributed.
  
  Rule of thumb: n ≥ 30 is usually sufficient for CLT to kick in
  (n ≥ 100 for very skewed distributions)
Python
from scipy.stats import expon

# Exponential distribution (not normal)  heavily right-skewed
exp_dist = expon(scale=2)  # mean = 2, std = 2

# Sample means from n=50 samples  should be approximately normal
n_trials = 10_000
n_per_sample = 50
sample_means = [
    expon(scale=2).rvs(n_per_sample).mean()
    for _ in range(n_trials)
]

import numpy as np
print(f"Distribution of sample means:")
print(f"  Mean: {np.mean(sample_means):.3f}")  #  2.0 (true mean)
print(f"  Std:  {np.std(sample_means, ddof=1):.3f}")  #  σ/√n = 2/√50 = 0.283

# Verify normality
from scipy.stats import shapiro
stat, p = shapiro(sample_means[:1000])  # Shapiro-Wilk test (n≤5000)
print(f"Shapiro-Wilk test on sample means: p = {p:.4f}")
# p > 0.05  fail to reject normality  CLT worked

Confidence Intervals from CLT

Python
# 95% confidence interval for the population mean
def confidence_interval(
    data: np.ndarray,
    confidence: float = 0.95,
) -> tuple[float, float]:
    n = len(data)
    mean = np.mean(data)
    sem = np.std(data, ddof=1) / np.sqrt(n)  # standard error of mean
    
    # z-score for confidence level (from N(0,1))
    z_crit = norm.ppf((1 + confidence) / 2)   # 1.96 for 95%
    margin = z_crit * sem
    
    return float(mean - margin), float(mean + margin)

# Example: model accuracy estimated from 100 test samples
test_accuracy = np.array([1.0 if pred == true else 0.0
                           for pred, true in zip(y_pred, y_test)])
lower, upper = confidence_interval(test_accuracy, 0.95)
print(f"Accuracy: {test_accuracy.mean():.3f}")
print(f"95% CI: ({lower:.3f}, {upper:.3f})")

Why Normal Distribution Is Everywhere in ML

1. Feature distributions:
   Many real measurements are approximately normal (heights, IQ, many lab values)
   → Gaussian Naive Bayes, LDA assume normal features per class

2. Weight initialisation:
   Xavier init: W ~ N(0, 2/(fan_in + fan_out))
   He init:     W ~ N(0, 2/fan_in)
   Chosen so activations don't vanish/explode

3. Noise in regression:
   y = w·x + ε, ε ~ N(0, σ²)
   MLE of this model = minimising MSE (closed form connection)

4. CLT → gradient statistics:
   Mini-batch gradient = mean of n per-sample gradients
   By CLT, roughly normally distributed → justifies Gaussian optimiser theory

5. Errors and residuals:
   Well-specified regression models have normally distributed residuals
   Checking normality of residuals validates model assumptions

6. Gaussian Processes:
   Every finite collection of outputs ~ multivariate Gaussian
   Enables closed-form Bayesian inference

Interview Answer

"The normal distribution N(μ, σ²) is characterised by its bell shape, with 68%, 95%, and 99.7% of values within 1, 2, and 3 standard deviations of the mean respectively. Z-scores standardise any normal variable to N(0,1). The Central Limit Theorem explains why it's ubiquitous: the sample mean of any i.i.d. random variables converges to a normal distribution as n grows — enabling normal distribution theory for means regardless of the underlying data shape. In ML: weight initialisation uses small-variance Gaussians to keep activations stable; MSE loss implicitly assumes Gaussian errors; and the CLT justifies treating mini-batch gradient estimates as approximately normally distributed."