Statistics & Math for AI/ML Interviews · Lesson 5 of 30
Range and Dispersion
Why Multiple Measures of Spread
Standard deviation is the dominant spread measure, but it has weaknesses. Different dispersion measures suit different data types and purposes:
Range: simplest, most affected by outliers
IQR (Interquartile): robust, ignores extremes
Mean Absolute Deviation: intuitive, less common
Coefficient of Variation: relative spread — compares across scales
Standard Deviation: most common, used in normal distribution theoryRange
Range = max(x) - min(x)
Example: [2, 4, 5, 6, 100]
Range = 100 - 2 = 98
Problem: one outlier (100) dominates the range completely
Use: quick check, not for serious analysis
In ML: check for obviously wrong values (feature value -999 suggests missing data encoded as number)Interquartile Range (IQR)
IQR = Q3 - Q1
Q1 = 25th percentile (median of lower half)
Q3 = 75th percentile (median of upper half)
Example: [2, 4, 5, 6, 100]
Q1 = 3 (median of [2, 4])
Q3 = 53 (median of [6, 100])
IQR = 53 - 3 = 50
Without the outlier [2, 4, 5, 6]:
Q1=2, Q3=6, IQR=4
Outlier detection (Tukey's fences):
Lower fence: Q1 - 1.5 × IQR
Upper fence: Q3 + 1.5 × IQR
Values outside fences are outliersImplementation
import numpy as np
from scipy import stats
data = [2, 4, 5, 6, 8, 9, 10, 100]
# Range
data_range = np.max(data) - np.min(data) # 98
# IQR
q1, q3 = np.percentile(data, [25, 75])
iqr = q3 - q1 # scipy: stats.iqr(data)
print(f"Q1={q1}, Q3={q3}, IQR={iqr}")
# Tukey's outlier fences
lower_fence = q1 - 1.5 * iqr
upper_fence = q3 + 1.5 * iqr
outliers = [x for x in data if x < lower_fence or x > upper_fence]
# Mean Absolute Deviation (MAD)
mean = np.mean(data)
mad = np.mean(np.abs(np.array(data) - mean))
# Coefficient of Variation (CV = std / mean × 100%)
cv = (np.std(data, ddof=1) / np.mean(data)) * 100
# CV makes sense only for positive data with a meaningful zeroIQR for Outlier Detection in ML
def iqr_outlier_mask(
X: np.ndarray,
threshold: float = 1.5, # 1.5 = Tukey's mild, 3.0 = extreme outliers
) -> np.ndarray:
"""Return boolean mask: True where a row has at least one outlier feature."""
q1 = np.percentile(X, 25, axis=0)
q3 = np.percentile(X, 75, axis=0)
iqr = q3 - q1
lower = q1 - threshold * iqr
upper = q3 + threshold * iqr
return np.any((X < lower) | (X > upper), axis=1)
import pandas as pd
df = pd.DataFrame({
"age": [25, 30, 28, 27, 200], # 200 is an outlier
"systolic_bp": [120, 125, 118, 130, 122],
})
outlier_mask = iqr_outlier_mask(df.values)
print(df[outlier_mask]) # row with age=200
print(df[~outlier_mask]) # clean rowsCoefficient of Variation
CV = (σ / μ) × 100%
Allows comparison of spread across different scales:
Model A test accuracy: mean=0.85, std=0.02 → CV = 2.4%
Model B test accuracy: mean=0.85, std=0.08 → CV = 9.4%
Same mean — B is much more variable across runs
Clinical lab normal range: INR mean=2.5, std=0.3 → CV = 12%
If a lab's INR measurements have CV = 30%, their assay is imprecise
Only valid when μ > 0 (ratio-scale data)Which Measure to Use When
Measure | Best for | Robust to outliers
-----------------|----------------------------------|--------------------
Range | Quick sanity check | No
IQR | Skewed data, outlier detection | Yes
Std Dev | Normal-ish data, gradient-based | No
MAD | Intuitive absolute spread | Somewhat
CV | Comparing across different scales| NoInterview Answer
"Range is the simplest spread measure but dominated by outliers. IQR (Q3 - Q1) is the interquartile range — robust to outliers, used for Tukey's fence outlier detection (flag values beyond Q1 - 1.5×IQR or Q3 + 1.5×IQR). Standard deviation is the standard for symmetric, roughly normal data. Coefficient of variation (std/mean × 100%) allows comparing spread across features on different scales. In ML: I use IQR outlier detection during feature engineering to flag and investigate anomalous training samples, and CV to compare model stability across different hyperparameter configurations."