Standard Deviation and Variance
What variance and standard deviation measure, how to compute them, population vs sample formulas, and their role in machine learning.
What They Measure
Mean tells you where the centre is. Variance and standard deviation tell you how spread out the data is around that centre.
Dataset A: [5, 5, 5, 5, 5] — mean=5, std=0 (no spread)
Dataset B: [1, 3, 5, 7, 9] — mean=5, std=2.83 (moderate spread)
Dataset C: [0, 0, 5, 10, 10] — mean=5, std=4.47 (high spread)Formulas
Population variance (σ²):
σ² = (1/N) × Σ(xᵢ - μ)²
Population standard deviation (σ):
σ = √σ²
Sample variance (s²) — use when data is a sample from a larger population:
s² = (1/(n-1)) × Σ(xᵢ - x̄)²
The (n-1) denominator is Bessel's correction — makes s² unbiased
Sample standard deviation (s):
s = √s²
Why (n-1)?
Estimating the mean from the same sample introduces bias
Dividing by n underestimates true population variance
n-1 corrects this — critical for small sample sizesStep-by-Step Calculation
Data: [4, 7, 13, 16] (sample)
Step 1: Mean
x̄ = (4 + 7 + 13 + 16) / 4 = 10
Step 2: Deviations from mean
4 - 10 = -6
7 - 10 = -3
13 - 10 = 3
16 - 10 = 6
Step 3: Squared deviations
(-6)² = 36
(-3)² = 9
(3)² = 9
(6)² = 36
Step 4: Sample variance
s² = (36 + 9 + 9 + 36) / (4 - 1) = 90 / 3 = 30
Step 5: Sample std
s = √30 ≈ 5.48Implementation
import numpy as np
data = [4, 7, 13, 16]
# NumPy: ddof=1 for sample, ddof=0 (default) for population
population_std = np.std(data, ddof=0) # 4.74
sample_std = np.std(data, ddof=1) # 5.48
population_var = np.var(data, ddof=0) # 22.5
sample_var = np.var(data, ddof=1) # 30.0
# Pandas: ddof=1 by default (sample)
import pandas as pd
s = pd.Series(data)
print(s.std()) # 5.48 (sample)
print(s.var()) # 30.0 (sample)
# Manual computation
mean = sum(data) / len(data)
squared_diffs = [(x - mean) ** 2 for x in data]
sample_var_manual = sum(squared_diffs) / (len(data) - 1)
sample_std_manual = sample_var_manual ** 0.5Standard Deviation in Machine Learning
# Feature normalisation (z-score)
# Transforms features to mean=0, std=1
from sklearn.preprocessing import StandardScaler
X = np.array([[1, 200], [2, 150], [3, 300], [4, 250]])
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Now each column has mean ≈ 0, std ≈ 1
# Weight initialisation — critical for training stability
# Xavier/Glorot: std = sqrt(2 / (fan_in + fan_out))
import torch
import torch.nn as nn
layer = nn.Linear(512, 256)
nn.init.xavier_uniform_(layer.weight) # std ≈ 0.07 — not too large or small
# Batch normalisation: computes mean and std per feature per batch
# Normalises activations, stabilises training
bn = nn.BatchNorm1d(256)
# Loss variance monitoring — sudden std spike = unstable training
losses = []
def monitor_loss_std(loss, window=100):
losses.append(float(loss))
if len(losses) >= window:
recent_std = np.std(losses[-window:])
if recent_std > 0.5: # threshold
print(f"Warning: high loss variance {recent_std:.3f} — check LR")The 68-95-99.7 Rule (Normal Distribution)
For normally distributed data:
μ ± 1σ contains ~68% of values
μ ± 2σ contains ~95% of values
μ ± 3σ contains ~99.7% of values
Practical use in ML:
Outlier detection: flag values beyond μ ± 3σ
Confidence intervals: mean ± 1.96σ gives 95% CI
Anomaly detection thresholdsInterview Answer
"Variance is the average squared deviation from the mean; standard deviation is its square root — both measure spread. Use sample formulas (divide by n-1 with Bessel's correction) when working with a sample rather than the full population. In ML: standard deviation drives z-score normalisation (subtract mean, divide by std) which equalises feature scales for gradient-based optimisation; weight initialisation schemes set std to √(2/fan_in) to keep activations stable through layers; and monitoring loss standard deviation across batches catches training instability early."
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.