Statistics & Math for AI/ML Interviews · Lesson 3 of 30
Population vs Sample
The Core Distinction
Population: the complete set of all items you care about
All patients with atrial fibrillation worldwide
Every possible image a model might see in deployment
All possible queries a RAG system might receive
Sample: a subset drawn from the population
500 AF patients in a clinical trial
ImageNet (1.2M images — a sample of all possible images)
1000 queries from a production log — used to evaluate the model
Why it matters:
We almost never have the full population — we make inferences from samples
Sample statistics are estimates of population parameters
The estimates have uncertainty (sampling error)Parameter vs Statistic
Population parameter (fixed, usually unknown):
μ = true population mean
σ = true population standard deviation
p = true population proportion
Sample statistic (computed from data, varies across samples):
x̄ = sample mean (estimates μ)
s = sample standard deviation (estimates σ)
p̂ = sample proportion (estimates p)
Goal of statistics: use sample statistics to estimate population parameters
with a quantified level of uncertainty (confidence intervals)Formulas Differ
Population mean: μ = (1/N) × Σxᵢ
Sample mean: x̄ = (1/n) × Σxᵢ
Same formula, different interpretation
Population variance: σ² = (1/N) × Σ(xᵢ - μ)²
Sample variance: s² = (1/(n-1)) × Σ(xᵢ - x̄)²
Different formula — n-1 corrects for bias (Bessel's correction)
Why n-1?
Using x̄ (sample mean) instead of μ (true mean) introduces bias
The sample mean is already the "best fit" to the sample
This underestimates variance — dividing by n-1 corrects it
For large n: n vs n-1 barely matters
For small n (n<30): the correction is significantIn Python
Python
import numpy as np
data = [12, 15, 14, 10, 13, 16, 11]
# Population statistics (you have ALL the data)
pop_mean = np.mean(data)
pop_std = np.std(data, ddof=0) # ddof=0: divide by N
pop_var = np.var(data, ddof=0)
# Sample statistics (your data is a sample — use these by default in ML)
sample_mean = np.mean(data) # same formula
sample_std = np.std(data, ddof=1) # ddof=1: divide by n-1
sample_var = np.var(data, ddof=1)
print(f"Pop std: {pop_std:.3f}") # 1.871
print(f"Sample std: {sample_std:.3f}") # 2.082
# Pandas default is sample (ddof=1) — correct for most ML use
import pandas as pd
s = pd.Series(data)
print(s.std()) # 2.082 — sample std
print(s.std(ddof=0)) # 1.871 — population stdSampling in Machine Learning
Python
# Train/test split: sample from your dataset for evaluation
from sklearn.model_selection import train_test_split
# This gives you a sample-based estimate of generalisation performance
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Bootstrap sampling: resample with replacement to estimate variance
import numpy as np
def bootstrap_confidence_interval(
data: np.ndarray,
stat_fn, # e.g., np.mean
n_bootstrap: int = 1000,
confidence: float = 0.95,
) -> tuple[float, float]:
bootstrap_stats = []
for _ in range(n_bootstrap):
sample = np.random.choice(data, size=len(data), replace=True)
bootstrap_stats.append(stat_fn(sample))
alpha = 1 - confidence
lower = np.percentile(bootstrap_stats, alpha / 2 * 100)
upper = np.percentile(bootstrap_stats, (1 - alpha / 2) * 100)
return float(lower), float(upper)
# Example: 95% CI for model accuracy
accuracies = np.array([0.82, 0.79, 0.84, 0.81, 0.83]) # from k-fold
ci = bootstrap_confidence_interval(accuracies, np.mean)
print(f"Accuracy: {np.mean(accuracies):.3f} (95% CI: {ci[0]:.3f} – {ci[1]:.3f})")Sampling Bias
Sampling bias: the sample is not representative of the population
Examples in ML:
Selection bias: training on hospital data from a single centre;
model performs poorly on patients from other hospitals
Temporal bias: training on 2020 data, deploying in 2026 — distribution shift
Survivorship bias: training on discharged patients only;
excludes patients who died (different risk profile)
Label bias: crowdsourced labels from a non-representative annotator pool
Clinical example:
AF trial: male patients aged 65–80 in a single UK centre
Population: all AF patients globally, including women, younger patients,
different ethnicities, different comorbidities
Model trained on this sample may not generalise to the full populationInterview Answer
"A population is the complete set you care about; a sample is a subset from which you estimate population parameters. The key formula difference: sample variance divides by n-1 (Bessel's correction) rather than n, because using the sample mean as a proxy for the true mean underestimates variance. In ML this matters in two ways: always use sample formulas when computing statistics on training data (it's a sample of all possible data); and be aware of sampling bias — if the training sample is unrepresentative of the deployment population, model performance will not generalise."