Handling Missing Values in ML
Complete guide to missing data: MCAR/MAR/MNAR mechanisms, imputation strategies (mean, median, mode, model-based, KNN), when to use each, and how to avoid data leakage in imputation.
Why Missing Values Happen
MCAR ā Missing Completely At Random
Example: lab test not ordered (random clinician variation)
Impact: dropping rows is safe; imputation improves efficiency
Detectable: missing values uncorrelated with any other variable
MAR ā Missing At Random (conditional on observed data)
Example: creatinine not measured for young, healthy patients
Impact: imputation conditioned on age/health status is valid
Detectable: missing values correlated with other observed variables
MNAR ā Missing Not At Random
Example: extremely high creatinine values not reported (too sick to draw blood)
Impact: simple imputation is biased ā missingness carries signal
Fix: model the missingness explicitly, or use a "missingness indicator" featureChecking for Missing Values
import pandas as pd
import numpy as np
# Example: EHR dataset with clinical labs
df = pd.DataFrame({
"age": [45, 67, 32, np.nan, 55],
"serum_creatinine": [1.2, np.nan, 0.8, 2.1, 1.5],
"hba1c": [7.2, 8.1, np.nan, np.nan, 6.9],
"num_medications": [5, 12, 3, 8, np.nan],
"readmitted": [0, 1, 0, 1, 0],
})
# Overview
print(df.isnull().sum())
print(df.isnull().mean().round(3)) # Fraction missing per feature
# Is missingness correlated with the target?
for col in df.columns[:-1]:
missing_mask = df[col].isnull()
if missing_mask.any():
readmit_rate_missing = df.loc[missing_mask, "readmitted"].mean()
readmit_rate_present = df.loc[~missing_mask, "readmitted"].mean()
print(f"{col}: missing readmit_rate={readmit_rate_missing:.2f}, "
f"present readmit_rate={readmit_rate_present:.2f}")
# If rates differ ā MNAR ā consider a missingness indicator featureImputation Strategies
Simple Imputation
from sklearn.impute import SimpleImputer
import numpy as np
X_train = np.array([
[45, 1.2, 7.2, 5],
[67, np.nan, 8.1, 12],
[32, 0.8, np.nan, 3],
[np.nan, 2.1, 8.8, 8],
])
# Mean imputation (for symmetric distributions)
mean_imputer = SimpleImputer(strategy="mean")
X_mean = mean_imputer.fit_transform(X_train)
# Median imputation (for skewed distributions, outliers)
median_imputer = SimpleImputer(strategy="median")
X_median = median_imputer.fit_transform(X_train)
# Most frequent (for categorical or ordinal features)
mode_imputer = SimpleImputer(strategy="most_frequent")
# Constant (e.g., -1 as "not measured" sentinel)
const_imputer = SimpleImputer(strategy="constant", fill_value=-1)
X_const = const_imputer.fit_transform(X_train)
print("After median imputation:")
print(X_median.round(2))KNN Imputation
from sklearn.impute import KNNImputer
# Impute based on the k nearest neighbors (using observed features)
# Better for MCAR/MAR when patient similarity is informative
knn_imputer = KNNImputer(n_neighbors=5)
X_knn = knn_imputer.fit_transform(X_train)
print("After KNN imputation:")
print(X_knn.round(2))
# Each missing value replaced by the weighted average of 5 nearest complete neighborsIterative Imputation (MICE)
from sklearn.experimental import enable_iterative_imputer # required
from sklearn.impute import IterativeImputer
from sklearn.ensemble import RandomForestRegressor
# Models each feature as a function of all others (multiple imputation)
# Best for MAR ā uses correlations between features to impute
mice_imputer = IterativeImputer(
estimator=RandomForestRegressor(n_estimators=10, random_state=42),
max_iter=10,
random_state=42,
)
X_mice = mice_imputer.fit_transform(X_train)
print("After MICE imputation:")
print(X_mice.round(2))The MNAR Fix: Missingness Indicators
import pandas as pd
import numpy as np
def add_missingness_indicators(df: pd.DataFrame, cols: list[str]) -> pd.DataFrame:
"""
For MNAR features: add a binary flag 'col_missing' before imputing.
The flag captures the information that was missing (a signal in itself).
"""
df = df.copy()
for col in cols:
if df[col].isnull().any():
df[f"{col}_missing"] = df[col].isnull().astype(int)
return df
# Clinical example: extreme creatinine values may not be drawn (patient too sick)
# ā high creatinine and missingness are correlated
# ā add a creatinine_missing indicator before imputing
df_with_flags = add_missingness_indicators(
df,
cols=["serum_creatinine", "hba1c"]
)
print(df_with_flags.columns.tolist())
# ā [..., 'serum_creatinine_missing', 'hba1c_missing']Imputation in a Pipeline (No Leakage)
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
# The imputer must be fitted on training data only
# A Pipeline handles this correctly in cross-validation
pipeline = Pipeline([
("imputer", SimpleImputer(strategy="median")),
("scaler", StandardScaler()),
("model", LogisticRegression(max_iter=1000)),
])
scores = cross_val_score(pipeline, X, y, cv=5, scoring="roc_auc")
print(f"Pipeline CV AUC: {scores.mean():.3f} ± {scores.std():.3f}")
# WRONG: fit imputer before the train/test split
# ā uses validation/test medians in training ā data leakage
imputer = SimpleImputer(strategy="median")
X_imputed = imputer.fit_transform(X) # DO NOT DO THIS before splittingImputation Comparison
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
imputers = {
"Mean": SimpleImputer(strategy="mean"),
"Median": SimpleImputer(strategy="median"),
"Constant": SimpleImputer(strategy="constant", fill_value=0),
"KNN-5": KNNImputer(n_neighbors=5),
"MICE": IterativeImputer(max_iter=10, random_state=42),
}
for name, imputer in imputers.items():
pipe = Pipeline([
("imputer", imputer),
("scaler", StandardScaler()),
("model", LogisticRegression(max_iter=1000)),
])
cv = cross_val_score(pipe, X, y, cv=5, scoring="roc_auc")
print(f"{name:<10}: {cv.mean():.3f} ± {cv.std():.3f}")Choosing an Imputation Strategy
| Situation | Strategy | |---|---| | Feature is roughly symmetric (age, weight) | Mean | | Feature is skewed (creatinine, CRP) | Median | | Categorical or ordinal | Most frequent | | Missingness carries signal (MNAR) | Constant (-1 sentinel) + missingness indicator | | Features are correlated; MAR mechanism | KNN or MICE | | Limited time; tree-based model | Constant (-1) ā trees handle it naturally | | Neural network | KNN or MICE (clean imputation preferred) |
Interview Answer Template
Q: How do you handle missing values in a dataset?
The first step is to understand why values are missing: MCAR (completely random ā imputation is fine), MAR (missing conditional on observed data ā use correlated features to impute), or MNAR (missing because of the value itself ā a signal, not random noise). For MNAR features, I add a binary missingness indicator before imputing, because the fact of missingness is predictive. For MCAR/MAR features, I choose the imputation method based on the feature distribution: median for skewed features (outlier-robust), KNN or MICE when features are correlated and the dataset is large enough. The critical rule: fit the imputer on training data only and apply to test ā using a sklearn Pipeline guarantees this in cross-validation. I never drop rows or features without first checking whether missingness is MNAR, because removing those rows could bias the dataset toward less-sick patients.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.