Learnixo

Statistics & Math for AI/ML Interviews · Lesson 8 of 30

IQR and Outlier Detection

What Is an Outlier?

An outlier is a value that is unusually far from the rest of the data. Outliers can be:

  • Genuine: a patient with truly extreme values (INR of 8.0)
  • Errors: data entry mistake (age = 999)
  • Interesting: fraudulent transaction, rare disease presentation

Identifying them matters before training ML models.


The IQR Method

Step 1: Find Q1 (25th percentile) and Q3 (75th percentile)
Step 2: Compute IQR = Q3 - Q1
Step 3: Compute Tukey's fences:
  Lower fence = Q1 - 1.5 × IQR
  Upper fence = Q3 + 1.5 × IQR
Step 4: Values outside the fences are outliers

Example: INR values [1.8, 2.1, 2.3, 2.4, 2.5, 2.6, 2.8, 8.0]
  Q1 = 2.15
  Q3 = 2.65
  IQR = 0.50
  Lower fence = 2.15 - 0.75 = 1.40
  Upper fence = 2.65 + 0.75 = 3.40
  
  8.0 > 3.40 → OUTLIER  (genuinely extreme INR, warrants clinical review)

Why IQR Over Std-Based Outlier Detection

Std-based: flag values beyond mean ± 3σ

Problem with outliers present:
  Data: [2, 3, 3, 4, 4, 5, 100]
  Mean = 17.3, Std = 35.7
  Upper fence = 17.3 + 3 × 35.7 = 124.4
  → 100 is NOT flagged (the outlier inflated the std)

IQR-based: Q1=3, Q3=4.5, IQR=1.5, Upper=4.5+2.25=6.75
  → 100 IS flagged correctly

IQR is robust because Q1 and Q3 are not affected by extreme values.

Python Implementation

Python
import numpy as np
import pandas as pd

def iqr_outlier_detection(
    data: np.ndarray,
    multiplier: float = 1.5,    # 1.5 = mild outliers, 3.0 = extreme
) -> dict:
    q1 = np.percentile(data, 25)
    q3 = np.percentile(data, 75)
    iqr = q3 - q1
    
    lower_fence = q1 - multiplier * iqr
    upper_fence = q3 + multiplier * iqr
    
    outlier_mask = (data < lower_fence) | (data > upper_fence)
    
    return {
        "q1": q1,
        "q3": q3,
        "iqr": iqr,
        "lower_fence": lower_fence,
        "upper_fence": upper_fence,
        "outlier_indices": np.where(outlier_mask)[0].tolist(),
        "outlier_values": data[outlier_mask].tolist(),
        "n_outliers": int(outlier_mask.sum()),
        "pct_outliers": float(outlier_mask.mean() * 100),
    }


# DataFrame column-wise
def detect_outliers_df(
    df: pd.DataFrame,
    numeric_cols: list[str] | None = None,
    multiplier: float = 1.5,
) -> pd.DataFrame:
    if numeric_cols is None:
        numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
    
    report = []
    for col in numeric_cols:
        result = iqr_outlier_detection(df[col].dropna().values, multiplier)
        report.append({
            "column": col,
            "n_outliers": result["n_outliers"],
            "pct_outliers": result["pct_outliers"],
            "lower_fence": result["lower_fence"],
            "upper_fence": result["upper_fence"],
        })
    
    return pd.DataFrame(report).sort_values("pct_outliers", ascending=False)

Handling Outliers: Options

Python
# 1. Remove outliers (if they're errors)
def remove_outliers(df: pd.DataFrame, col: str, multiplier: float = 1.5) -> pd.DataFrame:
    result = iqr_outlier_detection(df[col].values, multiplier)
    mask = (df[col] >= result["lower_fence"]) & (df[col] <= result["upper_fence"])
    return df[mask]


# 2. Cap/clip outliers (Winsorisation) — keep the row but limit extreme values
def winsorise(df: pd.DataFrame, col: str, multiplier: float = 1.5) -> pd.DataFrame:
    result = iqr_outlier_detection(df[col].values, multiplier)
    df = df.copy()
    df[col] = df[col].clip(lower=result["lower_fence"], upper=result["upper_fence"])
    return df


# 3. Flag outliers as a feature (the extreme value may itself be informative)
def flag_outliers(df: pd.DataFrame, col: str, multiplier: float = 1.5) -> pd.DataFrame:
    result = iqr_outlier_detection(df[col].values, multiplier)
    df = df.copy()
    df[f"{col}_is_outlier"] = (
        (df[col] < result["lower_fence"]) | (df[col] > result["upper_fence"])
    ).astype(int)
    return df


# 4. Investigate (don't blindly remove  clinical data anomalies may be real)
def inspect_outliers(df: pd.DataFrame, col: str, multiplier: float = 1.5) -> pd.DataFrame:
    result = iqr_outlier_detection(df[col].values, multiplier)
    return df[(df[col] < result["lower_fence"]) | (df[col] > result["upper_fence"])]

Clinical Consideration

In clinical ML, outliers require domain-expertise review — not automatic removal:

INR = 8.0: might be a genuine supratherapeutic result
  → Important clinical event, should stay in training data
  
Age = 999: data entry error
  → Remove or impute

Weight = 300kg: possible, but unusual
  → Investigate: is this lbs vs kg confusion? Or a genuinely obese patient?

Creatinine = 0.01: below plausible physiology
  → Likely a data error, remove

Rule: flag outliers for investigation, don't automatically delete.
      Document every decision in your data cleaning pipeline.

Interview Answer

"IQR outlier detection uses Tukey's fences: values below Q1 - 1.5×IQR or above Q3 + 1.5×IQR are outliers. IQR is preferred over std-based detection because it's robust — outliers inflate the mean and standard deviation, which causes std-based methods to miss them. The three options for handling outliers: remove (if data entry errors), winsorise/clip (if keeping the row but not the extreme value), or flag as a feature (if the extreme value itself is informative). For clinical data, I always investigate before removing — an INR of 8.0 might be a genuine patient event that's important for training the model to recognise danger cases."