Learnixo

Statistics & Math for AI/ML Interviews · Lesson 2 of 30

Standard Deviation vs Variance

What They Measure

Mean tells you where the centre is. Variance and standard deviation tell you how spread out the data is around that centre.

Dataset A: [5, 5, 5, 5, 5]  — mean=5, std=0  (no spread)
Dataset B: [1, 3, 5, 7, 9]  — mean=5, std=2.83  (moderate spread)
Dataset C: [0, 0, 5, 10, 10] — mean=5, std=4.47  (high spread)

Formulas

Population variance (σ²):
  σ² = (1/N) × Σ(xᵢ - μ)²

Population standard deviation (σ):
  σ = √σ²

Sample variance (s²) — use when data is a sample from a larger population:
  s² = (1/(n-1)) × Σ(xᵢ - x̄)²
  The (n-1) denominator is Bessel's correction — makes s² unbiased

Sample standard deviation (s):
  s = √s²

Why (n-1)?
  Estimating the mean from the same sample introduces bias
  Dividing by n underestimates true population variance
  n-1 corrects this — critical for small sample sizes

Step-by-Step Calculation

Data: [4, 7, 13, 16]  (sample)

Step 1: Mean
  x̄ = (4 + 7 + 13 + 16) / 4 = 10

Step 2: Deviations from mean
  4 - 10 = -6
  7 - 10 = -3
  13 - 10 = 3
  16 - 10 = 6

Step 3: Squared deviations
  (-6)² = 36
  (-3)² = 9
  (3)²  = 9
  (6)²  = 36

Step 4: Sample variance
  s² = (36 + 9 + 9 + 36) / (4 - 1) = 90 / 3 = 30

Step 5: Sample std
  s = √30 ≈ 5.48

Implementation

Python
import numpy as np

data = [4, 7, 13, 16]

# NumPy: ddof=1 for sample, ddof=0 (default) for population
population_std = np.std(data, ddof=0)   # 4.74
sample_std     = np.std(data, ddof=1)   # 5.48

population_var = np.var(data, ddof=0)   # 22.5
sample_var     = np.var(data, ddof=1)   # 30.0

# Pandas: ddof=1 by default (sample)
import pandas as pd
s = pd.Series(data)
print(s.std())   # 5.48 (sample)
print(s.var())   # 30.0 (sample)

# Manual computation
mean = sum(data) / len(data)
squared_diffs = [(x - mean) ** 2 for x in data]
sample_var_manual = sum(squared_diffs) / (len(data) - 1)
sample_std_manual = sample_var_manual ** 0.5

Standard Deviation in Machine Learning

Python
# Feature normalisation (z-score)
# Transforms features to mean=0, std=1
from sklearn.preprocessing import StandardScaler

X = np.array([[1, 200], [2, 150], [3, 300], [4, 250]])
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Now each column has mean  0, std  1

# Weight initialisation  critical for training stability
# Xavier/Glorot: std = sqrt(2 / (fan_in + fan_out))
import torch
import torch.nn as nn

layer = nn.Linear(512, 256)
nn.init.xavier_uniform_(layer.weight)  # std  0.07  not too large or small

# Batch normalisation: computes mean and std per feature per batch
# Normalises activations, stabilises training
bn = nn.BatchNorm1d(256)

# Loss variance monitoring  sudden std spike = unstable training
losses = []
def monitor_loss_std(loss, window=100):
    losses.append(float(loss))
    if len(losses) >= window:
        recent_std = np.std(losses[-window:])
        if recent_std > 0.5:  # threshold
            print(f"Warning: high loss variance {recent_std:.3f} — check LR")

The 68-95-99.7 Rule (Normal Distribution)

For normally distributed data:
  μ ± 1σ  contains ~68% of values
  μ ± 2σ  contains ~95% of values
  μ ± 3σ  contains ~99.7% of values

Practical use in ML:
  Outlier detection: flag values beyond μ ± 3σ
  Confidence intervals: mean ± 1.96σ gives 95% CI
  Anomaly detection thresholds

Interview Answer

"Variance is the average squared deviation from the mean; standard deviation is its square root — both measure spread. Use sample formulas (divide by n-1 with Bessel's correction) when working with a sample rather than the full population. In ML: standard deviation drives z-score normalisation (subtract mean, divide by std) which equalises feature scales for gradient-based optimisation; weight initialisation schemes set std to √(2/fan_in) to keep activations stable through layers; and monitoring loss standard deviation across batches catches training instability early."