Statistics & Math for AI/ML Interviews · Lesson 1 of 30
Mean, Median, Mode
The Three Measures
Mean: arithmetic average — sum divided by count
x̄ = (1/n) × Σxᵢ
Sensitive to outliers
Median: middle value when sorted
n odd: middle element
n even: average of two middle elements
Robust to outliers
Mode: most frequent value(s)
Can be multiple (bimodal, multimodal)
Only meaningful for discrete or categorical dataExample
Dataset: [2, 4, 4, 5, 6, 100] (hours of sleep, one outlier)
Mean: (2 + 4 + 4 + 5 + 6 + 100) / 6 = 121 / 6 ≈ 20.2
Distorted by the outlier (100)
Median: sorted → [2, 4, 4, 5, 6, 100]
n=6 (even), middle pair = (4, 5), median = (4+5)/2 = 4.5
Not affected by the outlier
Mode: 4 (appears twice, all others appear once)Implementation
Python
import numpy as np
from scipy import stats
data = [2, 4, 4, 5, 6, 100]
mean = np.mean(data) # 20.17
median = np.median(data) # 4.5
mode = stats.mode(data).mode # 4
# Pandas (common in data analysis)
import pandas as pd
s = pd.Series(data)
print(s.mean(), s.median(), s.mode()[0])
# For continuous data — mode from histogram peak
from scipy.stats import gaussian_kde
kde = gaussian_kde(data)
xs = np.linspace(min(data), max(data), 1000)
mode_continuous = xs[np.argmax(kde(xs))]When to Use Each
Use mean when:
Data is roughly symmetric (no heavy outliers)
You need a value that accounts for all data points
Summing makes sense (total revenue / n customers)
Examples: model loss averaging, batch metrics, A/B test means
Use median when:
Data has outliers or is skewed
You want the "typical" value
Examples: housing prices, income distributions, latency (P50)
In ML: median imputation for features with outlier values
Use mode when:
Categorical data
You want the most common class
Examples: most common prediction label, most frequent user action
In ML: mode imputation for categorical missing valuesIn Machine Learning
Python
# Mean in ML: batch loss averaging
batch_losses = [0.45, 0.52, 0.38, 0.91, 0.44]
mean_loss = np.mean(batch_losses) # 0.54 — pulled up by 0.91
# Median loss (more robust training signal in noisy settings)
median_loss = np.median(batch_losses) # 0.45
# Imputation example
import pandas as pd
df = pd.DataFrame({"age": [25, 30, None, 28, 200], "gender": ["M", "F", None, "M", "F"]})
df["age"].fillna(df["age"].median(), inplace=True) # robust to outlier 200
df["gender"].fillna(df["gender"].mode()[0], inplace=True) # most common value
# Model evaluation: mean vs median accuracy across k-fold
fold_accuracies = [0.82, 0.79, 0.95, 0.81, 0.80] # fold 3 suspiciously high
print(f"Mean: {np.mean(fold_accuracies):.3f}") # 0.834 — pulled up
print(f"Median: {np.median(fold_accuracies):.3f}") # 0.810 — more representativeRelationship: Skewed Distributions
Left-skewed (negative skew):
Mean < Median < Mode
Example: test scores where most score high, a few score very low
Symmetric (normal distribution):
Mean = Median = Mode
Right-skewed (positive skew):
Mode < Median < Mean
Example: income, house prices, ML training loss early in trainingInterview Answer
"Mean is the arithmetic average — sensitive to outliers, appropriate for symmetric distributions. Median is the middle value when sorted — robust to outliers, better for skewed data (income, latency, house prices). Mode is the most frequent value — only meaningful for discrete or categorical data. In ML: mean is the standard for loss averaging and metric reporting, but I use median when evaluating across folds with potentially anomalous results, and median/mode imputation for handling missing feature values robustly."