The Normal Distribution
The bell curve in depth — its parameters, the 68-95-99.7 rule, the Central Limit Theorem, z-scores, and why the normal distribution is everywhere in ML.
The Formula
f(x) = (1 / (σ√(2π))) × exp(-(x - μ)² / (2σ²))
Parameters:
μ = mean (centre of the bell curve)
σ = standard deviation (width of the bell curve)
σ² = variance
Standard normal: μ = 0, σ = 1 → denoted N(0, 1)
General normal: X ~ N(μ, σ²)The 68-95-99.7 Rule
For X ~ N(μ, σ²):
P(μ - σ ≤ X ≤ μ + σ) ≈ 0.683 (68.3%)
P(μ - 2σ ≤ X ≤ μ + 2σ) ≈ 0.954 (95.4%)
P(μ - 3σ ≤ X ≤ μ + 3σ) ≈ 0.997 (99.7%)
Practical use:
INR ~ N(2.5, 0.25) (mean 2.5, std 0.5)
68% of readings between 2.0 and 3.0 (therapeutic range)
95% between 1.5 and 3.5
Values beyond 3σ from mean (<1.0 or >4.0) are very unusualZ-Scores and Standardisation
z = (x - μ) / σ
Z-score: how many standard deviations x is from the mean
z = 0: exactly at the mean
z = 1: one std above mean
z = -2: two stds below mean
Any N(μ, σ²) variable can be standardised to N(0, 1):
If X ~ N(μ, σ²) then Z = (X - μ)/σ ~ N(0, 1)import numpy as np
from scipy.stats import norm
# INR example
mu_inr, sigma_inr = 2.5, 0.5
def z_score(x: float, mu: float, sigma: float) -> float:
return (x - mu) / sigma
inr_value = 3.8
z = z_score(inr_value, mu_inr, sigma_inr)
print(f"INR = {inr_value}: z = {z:.2f}") # z = 2.6
# Probability of seeing INR ≥ 3.8
p_above = norm(mu_inr, sigma_inr).sf(inr_value)
print(f"P(INR ≥ {inr_value}) = {p_above:.4f}") # 0.0047 — rare
# Feature standardisation in ML
X = np.random.normal(loc=5, scale=2, size=(1000, 10))
X_standardised = (X - X.mean(axis=0)) / X.std(axis=0, ddof=1)
print(f"After standardisation: mean≈{X_standardised.mean(axis=0).round(2)}, std≈{X_standardised.std(axis=0, ddof=1).round(2)}")The Central Limit Theorem (CLT)
The CLT is the most important theorem in statistics for ML:
If X₁, X₂, ..., Xₙ are i.i.d. with mean μ and variance σ²,
then as n → ∞:
x̄ = (1/n) Σ Xᵢ ~ N(μ, σ²/n) (approximately)
In words: the SAMPLE MEAN is approximately normally distributed,
regardless of the underlying distribution, when n is large enough.
Practical implication:
You can use normal distribution theory for means and averages
even when the underlying data is not normally distributed.
Rule of thumb: n ≥ 30 is usually sufficient for CLT to kick in
(n ≥ 100 for very skewed distributions)from scipy.stats import expon
# Exponential distribution (not normal) — heavily right-skewed
exp_dist = expon(scale=2) # mean = 2, std = 2
# Sample means from n=50 samples — should be approximately normal
n_trials = 10_000
n_per_sample = 50
sample_means = [
expon(scale=2).rvs(n_per_sample).mean()
for _ in range(n_trials)
]
import numpy as np
print(f"Distribution of sample means:")
print(f" Mean: {np.mean(sample_means):.3f}") # ≈ 2.0 (true mean)
print(f" Std: {np.std(sample_means, ddof=1):.3f}") # ≈ σ/√n = 2/√50 = 0.283
# Verify normality
from scipy.stats import shapiro
stat, p = shapiro(sample_means[:1000]) # Shapiro-Wilk test (n≤5000)
print(f"Shapiro-Wilk test on sample means: p = {p:.4f}")
# p > 0.05 → fail to reject normality → CLT workedConfidence Intervals from CLT
# 95% confidence interval for the population mean
def confidence_interval(
data: np.ndarray,
confidence: float = 0.95,
) -> tuple[float, float]:
n = len(data)
mean = np.mean(data)
sem = np.std(data, ddof=1) / np.sqrt(n) # standard error of mean
# z-score for confidence level (from N(0,1))
z_crit = norm.ppf((1 + confidence) / 2) # 1.96 for 95%
margin = z_crit * sem
return float(mean - margin), float(mean + margin)
# Example: model accuracy estimated from 100 test samples
test_accuracy = np.array([1.0 if pred == true else 0.0
for pred, true in zip(y_pred, y_test)])
lower, upper = confidence_interval(test_accuracy, 0.95)
print(f"Accuracy: {test_accuracy.mean():.3f}")
print(f"95% CI: ({lower:.3f}, {upper:.3f})")Why Normal Distribution Is Everywhere in ML
1. Feature distributions:
Many real measurements are approximately normal (heights, IQ, many lab values)
→ Gaussian Naive Bayes, LDA assume normal features per class
2. Weight initialisation:
Xavier init: W ~ N(0, 2/(fan_in + fan_out))
He init: W ~ N(0, 2/fan_in)
Chosen so activations don't vanish/explode
3. Noise in regression:
y = w·x + ε, ε ~ N(0, σ²)
MLE of this model = minimising MSE (closed form connection)
4. CLT → gradient statistics:
Mini-batch gradient = mean of n per-sample gradients
By CLT, roughly normally distributed → justifies Gaussian optimiser theory
5. Errors and residuals:
Well-specified regression models have normally distributed residuals
Checking normality of residuals validates model assumptions
6. Gaussian Processes:
Every finite collection of outputs ~ multivariate Gaussian
Enables closed-form Bayesian inferenceInterview Answer
"The normal distribution N(μ, σ²) is characterised by its bell shape, with 68%, 95%, and 99.7% of values within 1, 2, and 3 standard deviations of the mean respectively. Z-scores standardise any normal variable to N(0,1). The Central Limit Theorem explains why it's ubiquitous: the sample mean of any i.i.d. random variables converges to a normal distribution as n grows — enabling normal distribution theory for means regardless of the underlying data shape. In ML: weight initialisation uses small-variance Gaussians to keep activations stable; MSE loss implicitly assumes Gaussian errors; and the CLT justifies treating mini-batch gradient estimates as approximately normally distributed."
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.