Statistics & Math for AI/ML Interviews · Lesson 8 of 30
IQR and Outlier Detection
What Is an Outlier?
An outlier is a value that is unusually far from the rest of the data. Outliers can be:
- Genuine: a patient with truly extreme values (INR of 8.0)
- Errors: data entry mistake (age = 999)
- Interesting: fraudulent transaction, rare disease presentation
Identifying them matters before training ML models.
The IQR Method
Step 1: Find Q1 (25th percentile) and Q3 (75th percentile)
Step 2: Compute IQR = Q3 - Q1
Step 3: Compute Tukey's fences:
Lower fence = Q1 - 1.5 × IQR
Upper fence = Q3 + 1.5 × IQR
Step 4: Values outside the fences are outliers
Example: INR values [1.8, 2.1, 2.3, 2.4, 2.5, 2.6, 2.8, 8.0]
Q1 = 2.15
Q3 = 2.65
IQR = 0.50
Lower fence = 2.15 - 0.75 = 1.40
Upper fence = 2.65 + 0.75 = 3.40
8.0 > 3.40 → OUTLIER (genuinely extreme INR, warrants clinical review)Why IQR Over Std-Based Outlier Detection
Std-based: flag values beyond mean ± 3σ
Problem with outliers present:
Data: [2, 3, 3, 4, 4, 5, 100]
Mean = 17.3, Std = 35.7
Upper fence = 17.3 + 3 × 35.7 = 124.4
→ 100 is NOT flagged (the outlier inflated the std)
IQR-based: Q1=3, Q3=4.5, IQR=1.5, Upper=4.5+2.25=6.75
→ 100 IS flagged correctly
IQR is robust because Q1 and Q3 are not affected by extreme values.Python Implementation
import numpy as np
import pandas as pd
def iqr_outlier_detection(
data: np.ndarray,
multiplier: float = 1.5, # 1.5 = mild outliers, 3.0 = extreme
) -> dict:
q1 = np.percentile(data, 25)
q3 = np.percentile(data, 75)
iqr = q3 - q1
lower_fence = q1 - multiplier * iqr
upper_fence = q3 + multiplier * iqr
outlier_mask = (data < lower_fence) | (data > upper_fence)
return {
"q1": q1,
"q3": q3,
"iqr": iqr,
"lower_fence": lower_fence,
"upper_fence": upper_fence,
"outlier_indices": np.where(outlier_mask)[0].tolist(),
"outlier_values": data[outlier_mask].tolist(),
"n_outliers": int(outlier_mask.sum()),
"pct_outliers": float(outlier_mask.mean() * 100),
}
# DataFrame column-wise
def detect_outliers_df(
df: pd.DataFrame,
numeric_cols: list[str] | None = None,
multiplier: float = 1.5,
) -> pd.DataFrame:
if numeric_cols is None:
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
report = []
for col in numeric_cols:
result = iqr_outlier_detection(df[col].dropna().values, multiplier)
report.append({
"column": col,
"n_outliers": result["n_outliers"],
"pct_outliers": result["pct_outliers"],
"lower_fence": result["lower_fence"],
"upper_fence": result["upper_fence"],
})
return pd.DataFrame(report).sort_values("pct_outliers", ascending=False)Handling Outliers: Options
# 1. Remove outliers (if they're errors)
def remove_outliers(df: pd.DataFrame, col: str, multiplier: float = 1.5) -> pd.DataFrame:
result = iqr_outlier_detection(df[col].values, multiplier)
mask = (df[col] >= result["lower_fence"]) & (df[col] <= result["upper_fence"])
return df[mask]
# 2. Cap/clip outliers (Winsorisation) — keep the row but limit extreme values
def winsorise(df: pd.DataFrame, col: str, multiplier: float = 1.5) -> pd.DataFrame:
result = iqr_outlier_detection(df[col].values, multiplier)
df = df.copy()
df[col] = df[col].clip(lower=result["lower_fence"], upper=result["upper_fence"])
return df
# 3. Flag outliers as a feature (the extreme value may itself be informative)
def flag_outliers(df: pd.DataFrame, col: str, multiplier: float = 1.5) -> pd.DataFrame:
result = iqr_outlier_detection(df[col].values, multiplier)
df = df.copy()
df[f"{col}_is_outlier"] = (
(df[col] < result["lower_fence"]) | (df[col] > result["upper_fence"])
).astype(int)
return df
# 4. Investigate (don't blindly remove — clinical data anomalies may be real)
def inspect_outliers(df: pd.DataFrame, col: str, multiplier: float = 1.5) -> pd.DataFrame:
result = iqr_outlier_detection(df[col].values, multiplier)
return df[(df[col] < result["lower_fence"]) | (df[col] > result["upper_fence"])]Clinical Consideration
In clinical ML, outliers require domain-expertise review — not automatic removal:
INR = 8.0: might be a genuine supratherapeutic result
→ Important clinical event, should stay in training data
Age = 999: data entry error
→ Remove or impute
Weight = 300kg: possible, but unusual
→ Investigate: is this lbs vs kg confusion? Or a genuinely obese patient?
Creatinine = 0.01: below plausible physiology
→ Likely a data error, remove
Rule: flag outliers for investigation, don't automatically delete.
Document every decision in your data cleaning pipeline.Interview Answer
"IQR outlier detection uses Tukey's fences: values below Q1 - 1.5×IQR or above Q3 + 1.5×IQR are outliers. IQR is preferred over std-based detection because it's robust — outliers inflate the mean and standard deviation, which causes std-based methods to miss them. The three options for handling outliers: remove (if data entry errors), winsorise/clip (if keeping the row but not the extreme value), or flag as a feature (if the extreme value itself is informative). For clinical data, I always investigate before removing — an INR of 8.0 might be a genuine patient event that's important for training the model to recognise danger cases."