Range and Dispersion

Why Multiple Measures of Spread

Standard deviation is the dominant spread measure, but it has weaknesses. Different dispersion measures suit different data types and purposes:

Range:                   simplest, most affected by outliers
IQR (Interquartile):     robust, ignores extremes
Mean Absolute Deviation: intuitive, less common
Coefficient of Variation: relative spread — compares across scales
Standard Deviation:       most common, used in normal distribution theory

Range

Range = max(x) - min(x)

Example: [2, 4, 5, 6, 100]
  Range = 100 - 2 = 98

Problem: one outlier (100) dominates the range completely
Use: quick check, not for serious analysis
In ML: check for obviously wrong values (feature value -999 suggests missing data encoded as number)

Interquartile Range (IQR)

IQR = Q3 - Q1
  Q1 = 25th percentile (median of lower half)
  Q3 = 75th percentile (median of upper half)

Example: [2, 4, 5, 6, 100]
  Q1 = 3 (median of [2, 4])
  Q3 = 53 (median of [6, 100])
  IQR = 53 - 3 = 50
  
  Without the outlier [2, 4, 5, 6]:
  Q1=2, Q3=6, IQR=4

Outlier detection (Tukey's fences):
  Lower fence: Q1 - 1.5 × IQR
  Upper fence: Q3 + 1.5 × IQR
  Values outside fences are outliers

Implementation

Python

import numpy as np
from scipy import stats

data = [2, 4, 5, 6, 8, 9, 10, 100]

# Range
data_range = np.max(data) - np.min(data)  # 98

# IQR
q1, q3 = np.percentile(data, [25, 75])
iqr = q3 - q1   # scipy: stats.iqr(data)
print(f"Q1={q1}, Q3={q3}, IQR={iqr}")

# Tukey's outlier fences
lower_fence = q1 - 1.5 * iqr
upper_fence = q3 + 1.5 * iqr
outliers = [x for x in data if x < lower_fence or x > upper_fence]

# Mean Absolute Deviation (MAD)
mean = np.mean(data)
mad = np.mean(np.abs(np.array(data) - mean))

# Coefficient of Variation (CV = std / mean × 100%)
cv = (np.std(data, ddof=1) / np.mean(data)) * 100
# CV makes sense only for positive data with a meaningful zero

IQR for Outlier Detection in ML

Python

def iqr_outlier_mask(
    X: np.ndarray,
    threshold: float = 1.5,   # 1.5 = Tukey's mild, 3.0 = extreme outliers
) -> np.ndarray:
    """Return boolean mask: True where a row has at least one outlier feature."""
    q1 = np.percentile(X, 25, axis=0)
    q3 = np.percentile(X, 75, axis=0)
    iqr = q3 - q1
    
    lower = q1 - threshold * iqr
    upper = q3 + threshold * iqr
    
    return np.any((X < lower) | (X > upper), axis=1)


import pandas as pd

df = pd.DataFrame({
    "age": [25, 30, 28, 27, 200],    # 200 is an outlier
    "systolic_bp": [120, 125, 118, 130, 122],
})

outlier_mask = iqr_outlier_mask(df.values)
print(df[outlier_mask])   # row with age=200
print(df[~outlier_mask])  # clean rows

Coefficient of Variation

CV = (σ / μ) × 100%

Allows comparison of spread across different scales:

  Model A test accuracy: mean=0.85, std=0.02 → CV = 2.4%
  Model B test accuracy: mean=0.85, std=0.08 → CV = 9.4%
  Same mean — B is much more variable across runs

  Clinical lab normal range: INR mean=2.5, std=0.3 → CV = 12%
  If a lab's INR measurements have CV = 30%, their assay is imprecise

Only valid when μ > 0 (ratio-scale data)

Which Measure to Use When

Measure          | Best for                         | Robust to outliers
-----------------|----------------------------------|--------------------
Range            | Quick sanity check               | No
IQR              | Skewed data, outlier detection   | Yes
Std Dev          | Normal-ish data, gradient-based  | No
MAD              | Intuitive absolute spread        | Somewhat
CV               | Comparing across different scales| No

Interview Answer

"Range is the simplest spread measure but dominated by outliers. IQR (Q3 - Q1) is the interquartile range — robust to outliers, used for Tukey's fence outlier detection (flag values beyond Q1 - 1.5×IQR or Q3 + 1.5×IQR). Standard deviation is the standard for symmetric, roughly normal data. Coefficient of variation (std/mean × 100%) allows comparing spread across features on different scales. In ML: I use IQR outlier detection during feature engineering to flag and investigate anomalous training samples, and CV to compare model stability across different hyperparameter configurations."

Range and Dispersion

Why Multiple Measures of Spread

Range

Interquartile Range (IQR)

Implementation

IQR for Outlier Detection in ML

Coefficient of Variation

Which Measure to Use When

Interview Answer

Enjoyed this article?

Leave a comment