IQR and Outlier Detection
How the interquartile range identifies outliers using Tukey's fences, why it's robust to extreme values, and how to apply it to ML feature engineering.
What Is an Outlier?
An outlier is a value that is unusually far from the rest of the data. Outliers can be:
- Genuine: a patient with truly extreme values (INR of 8.0)
- Errors: data entry mistake (age = 999)
- Interesting: fraudulent transaction, rare disease presentation
Identifying them matters before training ML models.
The IQR Method
Step 1: Find Q1 (25th percentile) and Q3 (75th percentile)
Step 2: Compute IQR = Q3 - Q1
Step 3: Compute Tukey's fences:
Lower fence = Q1 - 1.5 × IQR
Upper fence = Q3 + 1.5 × IQR
Step 4: Values outside the fences are outliers
Example: INR values [1.8, 2.1, 2.3, 2.4, 2.5, 2.6, 2.8, 8.0]
Q1 = 2.15
Q3 = 2.65
IQR = 0.50
Lower fence = 2.15 - 0.75 = 1.40
Upper fence = 2.65 + 0.75 = 3.40
8.0 > 3.40 → OUTLIER (genuinely extreme INR, warrants clinical review)Why IQR Over Std-Based Outlier Detection
Std-based: flag values beyond mean ± 3σ
Problem with outliers present:
Data: [2, 3, 3, 4, 4, 5, 100]
Mean = 17.3, Std = 35.7
Upper fence = 17.3 + 3 × 35.7 = 124.4
→ 100 is NOT flagged (the outlier inflated the std)
IQR-based: Q1=3, Q3=4.5, IQR=1.5, Upper=4.5+2.25=6.75
→ 100 IS flagged correctly
IQR is robust because Q1 and Q3 are not affected by extreme values.Python Implementation
import numpy as np
import pandas as pd
def iqr_outlier_detection(
data: np.ndarray,
multiplier: float = 1.5, # 1.5 = mild outliers, 3.0 = extreme
) -> dict:
q1 = np.percentile(data, 25)
q3 = np.percentile(data, 75)
iqr = q3 - q1
lower_fence = q1 - multiplier * iqr
upper_fence = q3 + multiplier * iqr
outlier_mask = (data < lower_fence) | (data > upper_fence)
return {
"q1": q1,
"q3": q3,
"iqr": iqr,
"lower_fence": lower_fence,
"upper_fence": upper_fence,
"outlier_indices": np.where(outlier_mask)[0].tolist(),
"outlier_values": data[outlier_mask].tolist(),
"n_outliers": int(outlier_mask.sum()),
"pct_outliers": float(outlier_mask.mean() * 100),
}
# DataFrame column-wise
def detect_outliers_df(
df: pd.DataFrame,
numeric_cols: list[str] | None = None,
multiplier: float = 1.5,
) -> pd.DataFrame:
if numeric_cols is None:
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
report = []
for col in numeric_cols:
result = iqr_outlier_detection(df[col].dropna().values, multiplier)
report.append({
"column": col,
"n_outliers": result["n_outliers"],
"pct_outliers": result["pct_outliers"],
"lower_fence": result["lower_fence"],
"upper_fence": result["upper_fence"],
})
return pd.DataFrame(report).sort_values("pct_outliers", ascending=False)Handling Outliers: Options
# 1. Remove outliers (if they're errors)
def remove_outliers(df: pd.DataFrame, col: str, multiplier: float = 1.5) -> pd.DataFrame:
result = iqr_outlier_detection(df[col].values, multiplier)
mask = (df[col] >= result["lower_fence"]) & (df[col] <= result["upper_fence"])
return df[mask]
# 2. Cap/clip outliers (Winsorisation) — keep the row but limit extreme values
def winsorise(df: pd.DataFrame, col: str, multiplier: float = 1.5) -> pd.DataFrame:
result = iqr_outlier_detection(df[col].values, multiplier)
df = df.copy()
df[col] = df[col].clip(lower=result["lower_fence"], upper=result["upper_fence"])
return df
# 3. Flag outliers as a feature (the extreme value may itself be informative)
def flag_outliers(df: pd.DataFrame, col: str, multiplier: float = 1.5) -> pd.DataFrame:
result = iqr_outlier_detection(df[col].values, multiplier)
df = df.copy()
df[f"{col}_is_outlier"] = (
(df[col] < result["lower_fence"]) | (df[col] > result["upper_fence"])
).astype(int)
return df
# 4. Investigate (don't blindly remove — clinical data anomalies may be real)
def inspect_outliers(df: pd.DataFrame, col: str, multiplier: float = 1.5) -> pd.DataFrame:
result = iqr_outlier_detection(df[col].values, multiplier)
return df[(df[col] < result["lower_fence"]) | (df[col] > result["upper_fence"])]Clinical Consideration
In clinical ML, outliers require domain-expertise review — not automatic removal:
INR = 8.0: might be a genuine supratherapeutic result
→ Important clinical event, should stay in training data
Age = 999: data entry error
→ Remove or impute
Weight = 300kg: possible, but unusual
→ Investigate: is this lbs vs kg confusion? Or a genuinely obese patient?
Creatinine = 0.01: below plausible physiology
→ Likely a data error, remove
Rule: flag outliers for investigation, don't automatically delete.
Document every decision in your data cleaning pipeline.Interview Answer
"IQR outlier detection uses Tukey's fences: values below Q1 - 1.5×IQR or above Q3 + 1.5×IQR are outliers. IQR is preferred over std-based detection because it's robust — outliers inflate the mean and standard deviation, which causes std-based methods to miss them. The three options for handling outliers: remove (if data entry errors), winsorise/clip (if keeping the row but not the extreme value), or flag as a feature (if the extreme value itself is informative). For clinical data, I always investigate before removing — an INR of 8.0 might be a genuine patient event that's important for training the model to recognise danger cases."
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.