Machine Learning Foundations · Lesson 37 of 70

How to Handle Missing Values

Why Missing Values Happen

MCAR — Missing Completely At Random
  Example: lab test not ordered (random clinician variation)
  Impact: dropping rows is safe; imputation improves efficiency
  Detectable: missing values uncorrelated with any other variable

MAR — Missing At Random (conditional on observed data)
  Example: creatinine not measured for young, healthy patients
  Impact: imputation conditioned on age/health status is valid
  Detectable: missing values correlated with other observed variables

MNAR — Missing Not At Random
  Example: extremely high creatinine values not reported (too sick to draw blood)
  Impact: simple imputation is biased — missingness carries signal
  Fix: model the missingness explicitly, or use a "missingness indicator" feature

Checking for Missing Values

Python

import pandas as pd
import numpy as np

# Example: EHR dataset with clinical labs
df = pd.DataFrame({
    "age":              [45, 67, 32, np.nan, 55],
    "serum_creatinine": [1.2, np.nan, 0.8, 2.1, 1.5],
    "hba1c":            [7.2, 8.1, np.nan, np.nan, 6.9],
    "num_medications":  [5, 12, 3, 8, np.nan],
    "readmitted":       [0, 1, 0, 1, 0],
})

# Overview
print(df.isnull().sum())
print(df.isnull().mean().round(3))   # Fraction missing per feature

# Is missingness correlated with the target?
for col in df.columns[:-1]:
    missing_mask = df[col].isnull()
    if missing_mask.any():
        readmit_rate_missing = df.loc[missing_mask, "readmitted"].mean()
        readmit_rate_present = df.loc[~missing_mask, "readmitted"].mean()
        print(f"{col}: missing readmit_rate={readmit_rate_missing:.2f}, "
              f"present readmit_rate={readmit_rate_present:.2f}")
# If rates differ → MNAR → consider a missingness indicator feature

Imputation Strategies

Simple Imputation

Python

from sklearn.impute import SimpleImputer
import numpy as np

X_train = np.array([
    [45,  1.2, 7.2,  5],
    [67, np.nan, 8.1, 12],
    [32,  0.8, np.nan, 3],
    [np.nan, 2.1, 8.8,  8],
])

# Mean imputation (for symmetric distributions)
mean_imputer = SimpleImputer(strategy="mean")
X_mean = mean_imputer.fit_transform(X_train)

# Median imputation (for skewed distributions, outliers)
median_imputer = SimpleImputer(strategy="median")
X_median = median_imputer.fit_transform(X_train)

# Most frequent (for categorical or ordinal features)
mode_imputer = SimpleImputer(strategy="most_frequent")

# Constant (e.g., -1 as "not measured" sentinel)
const_imputer = SimpleImputer(strategy="constant", fill_value=-1)
X_const = const_imputer.fit_transform(X_train)

print("After median imputation:")
print(X_median.round(2))

KNN Imputation

Python

from sklearn.impute import KNNImputer

# Impute based on the k nearest neighbors (using observed features)
# Better for MCAR/MAR when patient similarity is informative

knn_imputer = KNNImputer(n_neighbors=5)
X_knn = knn_imputer.fit_transform(X_train)

print("After KNN imputation:")
print(X_knn.round(2))
# Each missing value replaced by the weighted average of 5 nearest complete neighbors

Iterative Imputation (MICE)

Python

from sklearn.experimental import enable_iterative_imputer   # required
from sklearn.impute import IterativeImputer
from sklearn.ensemble import RandomForestRegressor

# Models each feature as a function of all others (multiple imputation)
# Best for MAR — uses correlations between features to impute

mice_imputer = IterativeImputer(
    estimator=RandomForestRegressor(n_estimators=10, random_state=42),
    max_iter=10,
    random_state=42,
)
X_mice = mice_imputer.fit_transform(X_train)
print("After MICE imputation:")
print(X_mice.round(2))

The MNAR Fix: Missingness Indicators

Python

import pandas as pd
import numpy as np

def add_missingness_indicators(df: pd.DataFrame, cols: list[str]) -> pd.DataFrame:
    """
    For MNAR features: add a binary flag 'col_missing' before imputing.
    The flag captures the information that was missing (a signal in itself).
    """
    df = df.copy()
    for col in cols:
        if df[col].isnull().any():
            df[f"{col}_missing"] = df[col].isnull().astype(int)
    return df

# Clinical example: extreme creatinine values may not be drawn (patient too sick)
# → high creatinine and missingness are correlated
# → add a creatinine_missing indicator before imputing

df_with_flags = add_missingness_indicators(
    df,
    cols=["serum_creatinine", "hba1c"]
)
print(df_with_flags.columns.tolist())
# → [..., 'serum_creatinine_missing', 'hba1c_missing']

Imputation in a Pipeline (No Leakage)

Python

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

# The imputer must be fitted on training data only
# A Pipeline handles this correctly in cross-validation

pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler",  StandardScaler()),
    ("model",   LogisticRegression(max_iter=1000)),
])

scores = cross_val_score(pipeline, X, y, cv=5, scoring="roc_auc")
print(f"Pipeline CV AUC: {scores.mean():.3f} ± {scores.std():.3f}")

# WRONG: fit imputer before the train/test split
# → uses validation/test medians in training → data leakage
imputer = SimpleImputer(strategy="median")
X_imputed = imputer.fit_transform(X)   # DO NOT DO THIS before splitting

Imputation Comparison

Python

from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score

imputers = {
    "Mean":      SimpleImputer(strategy="mean"),
    "Median":    SimpleImputer(strategy="median"),
    "Constant":  SimpleImputer(strategy="constant", fill_value=0),
    "KNN-5":     KNNImputer(n_neighbors=5),
    "MICE":      IterativeImputer(max_iter=10, random_state=42),
}

for name, imputer in imputers.items():
    pipe = Pipeline([
        ("imputer", imputer),
        ("scaler",  StandardScaler()),
        ("model",   LogisticRegression(max_iter=1000)),
    ])
    cv = cross_val_score(pipe, X, y, cv=5, scoring="roc_auc")
    print(f"{name:<10}: {cv.mean():.3f} ± {cv.std():.3f}")

Choosing an Imputation Strategy

| Situation | Strategy | |---|---| | Feature is roughly symmetric (age, weight) | Mean | | Feature is skewed (creatinine, CRP) | Median | | Categorical or ordinal | Most frequent | | Missingness carries signal (MNAR) | Constant (-1 sentinel) + missingness indicator | | Features are correlated; MAR mechanism | KNN or MICE | | Limited time; tree-based model | Constant (-1) — trees handle it naturally | | Neural network | KNN or MICE (clean imputation preferred) |

Interview Answer Template

Q: How do you handle missing values in a dataset?

The first step is to understand why values are missing: MCAR (completely random — imputation is fine), MAR (missing conditional on observed data — use correlated features to impute), or MNAR (missing because of the value itself — a signal, not random noise). For MNAR features, I add a binary missingness indicator before imputing, because the fact of missingness is predictive. For MCAR/MAR features, I choose the imputation method based on the feature distribution: median for skewed features (outlier-robust), KNN or MICE when features are correlated and the dataset is large enough. The critical rule: fit the imputer on training data only and apply to test — using a sklearn Pipeline guarantees this in cross-validation. I never drop rows or features without first checking whether missingness is MNAR, because removing those rows could bias the dataset toward less-sick patients.

What is Feature Engineering?

Next Lesson

Encoding Categorical Variables: One-Hot, Label, Target