Statistics & Math for AI/ML Interviews · Lesson 2 of 30
Standard Deviation vs Variance
What They Measure
Mean tells you where the centre is. Variance and standard deviation tell you how spread out the data is around that centre.
Dataset A: [5, 5, 5, 5, 5] — mean=5, std=0 (no spread)
Dataset B: [1, 3, 5, 7, 9] — mean=5, std=2.83 (moderate spread)
Dataset C: [0, 0, 5, 10, 10] — mean=5, std=4.47 (high spread)Formulas
Population variance (σ²):
σ² = (1/N) × Σ(xᵢ - μ)²
Population standard deviation (σ):
σ = √σ²
Sample variance (s²) — use when data is a sample from a larger population:
s² = (1/(n-1)) × Σ(xᵢ - x̄)²
The (n-1) denominator is Bessel's correction — makes s² unbiased
Sample standard deviation (s):
s = √s²
Why (n-1)?
Estimating the mean from the same sample introduces bias
Dividing by n underestimates true population variance
n-1 corrects this — critical for small sample sizesStep-by-Step Calculation
Data: [4, 7, 13, 16] (sample)
Step 1: Mean
x̄ = (4 + 7 + 13 + 16) / 4 = 10
Step 2: Deviations from mean
4 - 10 = -6
7 - 10 = -3
13 - 10 = 3
16 - 10 = 6
Step 3: Squared deviations
(-6)² = 36
(-3)² = 9
(3)² = 9
(6)² = 36
Step 4: Sample variance
s² = (36 + 9 + 9 + 36) / (4 - 1) = 90 / 3 = 30
Step 5: Sample std
s = √30 ≈ 5.48Implementation
import numpy as np
data = [4, 7, 13, 16]
# NumPy: ddof=1 for sample, ddof=0 (default) for population
population_std = np.std(data, ddof=0) # 4.74
sample_std = np.std(data, ddof=1) # 5.48
population_var = np.var(data, ddof=0) # 22.5
sample_var = np.var(data, ddof=1) # 30.0
# Pandas: ddof=1 by default (sample)
import pandas as pd
s = pd.Series(data)
print(s.std()) # 5.48 (sample)
print(s.var()) # 30.0 (sample)
# Manual computation
mean = sum(data) / len(data)
squared_diffs = [(x - mean) ** 2 for x in data]
sample_var_manual = sum(squared_diffs) / (len(data) - 1)
sample_std_manual = sample_var_manual ** 0.5Standard Deviation in Machine Learning
# Feature normalisation (z-score)
# Transforms features to mean=0, std=1
from sklearn.preprocessing import StandardScaler
X = np.array([[1, 200], [2, 150], [3, 300], [4, 250]])
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Now each column has mean ≈ 0, std ≈ 1
# Weight initialisation — critical for training stability
# Xavier/Glorot: std = sqrt(2 / (fan_in + fan_out))
import torch
import torch.nn as nn
layer = nn.Linear(512, 256)
nn.init.xavier_uniform_(layer.weight) # std ≈ 0.07 — not too large or small
# Batch normalisation: computes mean and std per feature per batch
# Normalises activations, stabilises training
bn = nn.BatchNorm1d(256)
# Loss variance monitoring — sudden std spike = unstable training
losses = []
def monitor_loss_std(loss, window=100):
losses.append(float(loss))
if len(losses) >= window:
recent_std = np.std(losses[-window:])
if recent_std > 0.5: # threshold
print(f"Warning: high loss variance {recent_std:.3f} — check LR")The 68-95-99.7 Rule (Normal Distribution)
For normally distributed data:
μ ± 1σ contains ~68% of values
μ ± 2σ contains ~95% of values
μ ± 3σ contains ~99.7% of values
Practical use in ML:
Outlier detection: flag values beyond μ ± 3σ
Confidence intervals: mean ± 1.96σ gives 95% CI
Anomaly detection thresholdsInterview Answer
"Variance is the average squared deviation from the mean; standard deviation is its square root — both measure spread. Use sample formulas (divide by n-1 with Bessel's correction) when working with a sample rather than the full population. In ML: standard deviation drives z-score normalisation (subtract mean, divide by std) which equalises feature scales for gradient-based optimisation; weight initialisation schemes set std to √(2/fan_in) to keep activations stable through layers; and monitoring loss standard deviation across batches catches training instability early."