Learnixo

Statistics & Math for AI/ML Interviews · Lesson 5 of 30

Range and Dispersion

Why Multiple Measures of Spread

Standard deviation is the dominant spread measure, but it has weaknesses. Different dispersion measures suit different data types and purposes:

Range:                   simplest, most affected by outliers
IQR (Interquartile):     robust, ignores extremes
Mean Absolute Deviation: intuitive, less common
Coefficient of Variation: relative spread — compares across scales
Standard Deviation:       most common, used in normal distribution theory

Range

Range = max(x) - min(x)

Example: [2, 4, 5, 6, 100]
  Range = 100 - 2 = 98

Problem: one outlier (100) dominates the range completely
Use: quick check, not for serious analysis
In ML: check for obviously wrong values (feature value -999 suggests missing data encoded as number)

Interquartile Range (IQR)

IQR = Q3 - Q1
  Q1 = 25th percentile (median of lower half)
  Q3 = 75th percentile (median of upper half)

Example: [2, 4, 5, 6, 100]
  Q1 = 3 (median of [2, 4])
  Q3 = 53 (median of [6, 100])
  IQR = 53 - 3 = 50
  
  Without the outlier [2, 4, 5, 6]:
  Q1=2, Q3=6, IQR=4

Outlier detection (Tukey's fences):
  Lower fence: Q1 - 1.5 × IQR
  Upper fence: Q3 + 1.5 × IQR
  Values outside fences are outliers

Implementation

Python
import numpy as np
from scipy import stats

data = [2, 4, 5, 6, 8, 9, 10, 100]

# Range
data_range = np.max(data) - np.min(data)  # 98

# IQR
q1, q3 = np.percentile(data, [25, 75])
iqr = q3 - q1   # scipy: stats.iqr(data)
print(f"Q1={q1}, Q3={q3}, IQR={iqr}")

# Tukey's outlier fences
lower_fence = q1 - 1.5 * iqr
upper_fence = q3 + 1.5 * iqr
outliers = [x for x in data if x < lower_fence or x > upper_fence]

# Mean Absolute Deviation (MAD)
mean = np.mean(data)
mad = np.mean(np.abs(np.array(data) - mean))

# Coefficient of Variation (CV = std / mean × 100%)
cv = (np.std(data, ddof=1) / np.mean(data)) * 100
# CV makes sense only for positive data with a meaningful zero

IQR for Outlier Detection in ML

Python
def iqr_outlier_mask(
    X: np.ndarray,
    threshold: float = 1.5,   # 1.5 = Tukey's mild, 3.0 = extreme outliers
) -> np.ndarray:
    """Return boolean mask: True where a row has at least one outlier feature."""
    q1 = np.percentile(X, 25, axis=0)
    q3 = np.percentile(X, 75, axis=0)
    iqr = q3 - q1
    
    lower = q1 - threshold * iqr
    upper = q3 + threshold * iqr
    
    return np.any((X < lower) | (X > upper), axis=1)


import pandas as pd

df = pd.DataFrame({
    "age": [25, 30, 28, 27, 200],    # 200 is an outlier
    "systolic_bp": [120, 125, 118, 130, 122],
})

outlier_mask = iqr_outlier_mask(df.values)
print(df[outlier_mask])   # row with age=200
print(df[~outlier_mask])  # clean rows

Coefficient of Variation

CV = (σ / μ) × 100%

Allows comparison of spread across different scales:

  Model A test accuracy: mean=0.85, std=0.02 → CV = 2.4%
  Model B test accuracy: mean=0.85, std=0.08 → CV = 9.4%
  Same mean — B is much more variable across runs

  Clinical lab normal range: INR mean=2.5, std=0.3 → CV = 12%
  If a lab's INR measurements have CV = 30%, their assay is imprecise

Only valid when μ > 0 (ratio-scale data)

Which Measure to Use When

Measure          | Best for                         | Robust to outliers
-----------------|----------------------------------|--------------------
Range            | Quick sanity check               | No
IQR              | Skewed data, outlier detection   | Yes
Std Dev          | Normal-ish data, gradient-based  | No
MAD              | Intuitive absolute spread        | Somewhat
CV               | Comparing across different scales| No

Interview Answer

"Range is the simplest spread measure but dominated by outliers. IQR (Q3 - Q1) is the interquartile range — robust to outliers, used for Tukey's fence outlier detection (flag values beyond Q1 - 1.5×IQR or Q3 + 1.5×IQR). Standard deviation is the standard for symmetric, roughly normal data. Coefficient of variation (std/mean × 100%) allows comparing spread across features on different scales. In ML: I use IQR outlier detection during feature engineering to flag and investigate anomalous training samples, and CV to compare model stability across different hyperparameter configurations."