Learnixo

Statistics & Math for AI/ML Interviews · Lesson 3 of 30

Population vs Sample

The Core Distinction

Population: the complete set of all items you care about
  All patients with atrial fibrillation worldwide
  Every possible image a model might see in deployment
  All possible queries a RAG system might receive

Sample: a subset drawn from the population
  500 AF patients in a clinical trial
  ImageNet (1.2M images — a sample of all possible images)
  1000 queries from a production log — used to evaluate the model

Why it matters:
  We almost never have the full population — we make inferences from samples
  Sample statistics are estimates of population parameters
  The estimates have uncertainty (sampling error)

Parameter vs Statistic

Population parameter (fixed, usually unknown):
  μ = true population mean
  σ = true population standard deviation
  p = true population proportion

Sample statistic (computed from data, varies across samples):
  x̄ = sample mean (estimates μ)
  s = sample standard deviation (estimates σ)
  p̂ = sample proportion (estimates p)

Goal of statistics: use sample statistics to estimate population parameters
with a quantified level of uncertainty (confidence intervals)

Formulas Differ

Population mean:     μ = (1/N) × Σxᵢ
Sample mean:         x̄ = (1/n) × Σxᵢ
Same formula, different interpretation

Population variance: σ² = (1/N) × Σ(xᵢ - μ)²
Sample variance:     s² = (1/(n-1)) × Σ(xᵢ - x̄)²
Different formula — n-1 corrects for bias (Bessel's correction)

Why n-1?
  Using x̄ (sample mean) instead of μ (true mean) introduces bias
  The sample mean is already the "best fit" to the sample
  This underestimates variance — dividing by n-1 corrects it
  For large n: n vs n-1 barely matters
  For small n (n<30): the correction is significant

In Python

Python
import numpy as np

data = [12, 15, 14, 10, 13, 16, 11]

# Population statistics (you have ALL the data)
pop_mean = np.mean(data)
pop_std  = np.std(data, ddof=0)    # ddof=0: divide by N
pop_var  = np.var(data, ddof=0)

# Sample statistics (your data is a sample  use these by default in ML)
sample_mean = np.mean(data)         # same formula
sample_std  = np.std(data, ddof=1)  # ddof=1: divide by n-1
sample_var  = np.var(data, ddof=1)

print(f"Pop std: {pop_std:.3f}")     # 1.871
print(f"Sample std: {sample_std:.3f}")  # 2.082

# Pandas default is sample (ddof=1)  correct for most ML use
import pandas as pd
s = pd.Series(data)
print(s.std())   # 2.082  sample std
print(s.std(ddof=0))  # 1.871  population std

Sampling in Machine Learning

Python
# Train/test split: sample from your dataset for evaluation
from sklearn.model_selection import train_test_split

# This gives you a sample-based estimate of generalisation performance
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Bootstrap sampling: resample with replacement to estimate variance
import numpy as np

def bootstrap_confidence_interval(
    data: np.ndarray,
    stat_fn,      # e.g., np.mean
    n_bootstrap: int = 1000,
    confidence: float = 0.95,
) -> tuple[float, float]:
    bootstrap_stats = []
    for _ in range(n_bootstrap):
        sample = np.random.choice(data, size=len(data), replace=True)
        bootstrap_stats.append(stat_fn(sample))
    
    alpha = 1 - confidence
    lower = np.percentile(bootstrap_stats, alpha / 2 * 100)
    upper = np.percentile(bootstrap_stats, (1 - alpha / 2) * 100)
    return float(lower), float(upper)

# Example: 95% CI for model accuracy
accuracies = np.array([0.82, 0.79, 0.84, 0.81, 0.83])  # from k-fold
ci = bootstrap_confidence_interval(accuracies, np.mean)
print(f"Accuracy: {np.mean(accuracies):.3f} (95% CI: {ci[0]:.3f} – {ci[1]:.3f})")

Sampling Bias

Sampling bias: the sample is not representative of the population

Examples in ML:
  Selection bias: training on hospital data from a single centre;
    model performs poorly on patients from other hospitals
  Temporal bias: training on 2020 data, deploying in 2026 — distribution shift
  Survivorship bias: training on discharged patients only;
    excludes patients who died (different risk profile)
  Label bias: crowdsourced labels from a non-representative annotator pool

Clinical example:
  AF trial: male patients aged 65–80 in a single UK centre
  Population: all AF patients globally, including women, younger patients,
  different ethnicities, different comorbidities
  Model trained on this sample may not generalise to the full population

Interview Answer

"A population is the complete set you care about; a sample is a subset from which you estimate population parameters. The key formula difference: sample variance divides by n-1 (Bessel's correction) rather than n, because using the sample mean as a proxy for the true mean underestimates variance. In ML this matters in two ways: always use sample formulas when computing statistics on training data (it's a sample of all possible data); and be aware of sampling bias — if the training sample is unrepresentative of the deployment population, model performance will not generalise."