Range and Dispersion
Measures of spread beyond standard deviation — range, IQR, mean absolute deviation, and coefficient of variation — and when each is appropriate in ML contexts.
Why Multiple Measures of Spread
Standard deviation is the dominant spread measure, but it has weaknesses. Different dispersion measures suit different data types and purposes:
Range: simplest, most affected by outliers
IQR (Interquartile): robust, ignores extremes
Mean Absolute Deviation: intuitive, less common
Coefficient of Variation: relative spread — compares across scales
Standard Deviation: most common, used in normal distribution theoryRange
Range = max(x) - min(x)
Example: [2, 4, 5, 6, 100]
Range = 100 - 2 = 98
Problem: one outlier (100) dominates the range completely
Use: quick check, not for serious analysis
In ML: check for obviously wrong values (feature value -999 suggests missing data encoded as number)Interquartile Range (IQR)
IQR = Q3 - Q1
Q1 = 25th percentile (median of lower half)
Q3 = 75th percentile (median of upper half)
Example: [2, 4, 5, 6, 100]
Q1 = 3 (median of [2, 4])
Q3 = 53 (median of [6, 100])
IQR = 53 - 3 = 50
Without the outlier [2, 4, 5, 6]:
Q1=2, Q3=6, IQR=4
Outlier detection (Tukey's fences):
Lower fence: Q1 - 1.5 × IQR
Upper fence: Q3 + 1.5 × IQR
Values outside fences are outliersImplementation
import numpy as np
from scipy import stats
data = [2, 4, 5, 6, 8, 9, 10, 100]
# Range
data_range = np.max(data) - np.min(data) # 98
# IQR
q1, q3 = np.percentile(data, [25, 75])
iqr = q3 - q1 # scipy: stats.iqr(data)
print(f"Q1={q1}, Q3={q3}, IQR={iqr}")
# Tukey's outlier fences
lower_fence = q1 - 1.5 * iqr
upper_fence = q3 + 1.5 * iqr
outliers = [x for x in data if x < lower_fence or x > upper_fence]
# Mean Absolute Deviation (MAD)
mean = np.mean(data)
mad = np.mean(np.abs(np.array(data) - mean))
# Coefficient of Variation (CV = std / mean × 100%)
cv = (np.std(data, ddof=1) / np.mean(data)) * 100
# CV makes sense only for positive data with a meaningful zeroIQR for Outlier Detection in ML
def iqr_outlier_mask(
X: np.ndarray,
threshold: float = 1.5, # 1.5 = Tukey's mild, 3.0 = extreme outliers
) -> np.ndarray:
"""Return boolean mask: True where a row has at least one outlier feature."""
q1 = np.percentile(X, 25, axis=0)
q3 = np.percentile(X, 75, axis=0)
iqr = q3 - q1
lower = q1 - threshold * iqr
upper = q3 + threshold * iqr
return np.any((X < lower) | (X > upper), axis=1)
import pandas as pd
df = pd.DataFrame({
"age": [25, 30, 28, 27, 200], # 200 is an outlier
"systolic_bp": [120, 125, 118, 130, 122],
})
outlier_mask = iqr_outlier_mask(df.values)
print(df[outlier_mask]) # row with age=200
print(df[~outlier_mask]) # clean rowsCoefficient of Variation
CV = (σ / μ) × 100%
Allows comparison of spread across different scales:
Model A test accuracy: mean=0.85, std=0.02 → CV = 2.4%
Model B test accuracy: mean=0.85, std=0.08 → CV = 9.4%
Same mean — B is much more variable across runs
Clinical lab normal range: INR mean=2.5, std=0.3 → CV = 12%
If a lab's INR measurements have CV = 30%, their assay is imprecise
Only valid when μ > 0 (ratio-scale data)Which Measure to Use When
Measure | Best for | Robust to outliers
-----------------|----------------------------------|--------------------
Range | Quick sanity check | No
IQR | Skewed data, outlier detection | Yes
Std Dev | Normal-ish data, gradient-based | No
MAD | Intuitive absolute spread | Somewhat
CV | Comparing across different scales| NoInterview Answer
"Range is the simplest spread measure but dominated by outliers. IQR (Q3 - Q1) is the interquartile range — robust to outliers, used for Tukey's fence outlier detection (flag values beyond Q1 - 1.5×IQR or Q3 + 1.5×IQR). Standard deviation is the standard for symmetric, roughly normal data. Coefficient of variation (std/mean × 100%) allows comparing spread across features on different scales. In ML: I use IQR outlier detection during feature engineering to flag and investigate anomalous training samples, and CV to compare model stability across different hyperparameter configurations."
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.