Interview: Feature Engineering Scenario
Interview walk-through: engineer features from raw EHR data for a 30-day readmission model — covering extraction, transformation, interactions, handling missing values, and validating feature quality.
The Scenario
You're given a raw EHR dataset with admission records for diabetic patients. You need to build a 30-day readmission prediction model. The dataset has: patient demographics, admission/discharge timestamps, a comma-separated medications field, a list of ICD-10 diagnosis codes, and the most recent HbA1c value. What features would you engineer?
Step 1: Understand the Raw Data
import pandas as pd
import numpy as np
# Raw record — what you get from the EHR
raw = pd.DataFrame({
"patient_id": [1001],
"admission_date": ["2026-03-15"],
"discharge_date": ["2026-03-22"],
"age": [67],
"gender": ["M"],
"weight_kg": [94],
"height_cm": [175],
"medications": ["metformin 1000mg, lisinopril 10mg, aspirin 81mg, atorvastatin 40mg, insulin glargine"],
"diagnoses": ["E11.9, I10, N18.3, E78.5"],
"hba1c": [8.7],
"serum_creatinine": [1.5],
"prior_admissions_1yr": [2],
"readmitted_30d": [1], # target
})
# Immediately ask:
# 1. What are the data types? (timestamps as strings, lists as strings)
# 2. What's missing? (HbA1c null for many patients)
# 3. What domain knowledge applies? (HbA1c > 8 = poor control, creatinine > 1.2 = CKD)
print(raw.dtypes)
print(raw.isnull().sum())Step 2: Extract Features from Timestamps
def extract_time_features(df: pd.DataFrame) -> pd.DataFrame:
admission = pd.to_datetime(df["admission_date"])
discharge = pd.to_datetime(df["discharge_date"])
return df.assign(
length_of_stay = (discharge - admission).dt.days,
admission_month = admission.dt.month,
admission_dayofweek = admission.dt.dayofweek, # 0=Mon, 6=Sun
admitted_on_weekend = (admission.dt.dayofweek >= 5).astype(int),
)
# Why length_of_stay? Longer stays → more complex illness → higher readmission risk
# Why admission month? Seasonal infections (flu season) affect readmissionStep 3: Parse and Engineer from Medication List
ANTICOAGULANTS = {"warfarin", "apixaban", "rivaroxaban", "heparin", "enoxaparin"}
ANTIDIABETICS = {"metformin", "insulin", "glargine", "glipizide", "sitagliptin"}
HIGH_RISK_MEDS = {"warfarin", "insulin", "digoxin", "lithium", "methotrexate"}
def parse_medication_features(medications_str: str) -> dict:
meds = [m.strip().split()[0].lower() for m in medications_str.split(",")]
return {
"num_medications": len(meds),
"on_anticoagulant": int(any(m in ANTICOAGULANTS for m in meds)),
"on_insulin": int(any("insulin" in m for m in meds)),
"on_metformin": int("metformin" in meds),
"high_risk_med_count": sum(1 for m in meds if any(hrm in m for hrm in HIGH_RISK_MEDS)),
"med_complexity": len(meds) * (1 + sum(1 for m in meds if any(hrm in m for hrm in HIGH_RISK_MEDS))),
}
# High medication count → polypharmacy → higher readmission risk
# High-risk medications require closer monitoring → different readmission dynamicsStep 4: Parse and Engineer from Diagnosis Codes
def parse_diagnosis_features(diagnoses_str: str) -> dict:
codes = [d.strip() for d in diagnoses_str.split(",")]
# ICD-10 chapter flags (first character)
has_endocrine = any(c.startswith("E") for c in codes) # diabetes, thyroid
has_circulatory = any(c.startswith("I") for c in codes) # hypertension, heart
has_renal = any(c.startswith("N") for c in codes) # CKD, renal failure
has_respiratory = any(c.startswith("J") for c in codes) # COPD, pneumonia
# Specific high-risk conditions
ckd = any(c.startswith("N18") for c in codes)
heart_failure = any(c.startswith("I50") for c in codes)
copd = any(c.startswith("J44") for c in codes)
return {
"num_diagnoses": len(codes),
"has_ckd": int(ckd),
"has_heart_failure": int(heart_failure),
"has_copd": int(copd),
"num_high_risk_comorbidities": int(ckd) + int(heart_failure) + int(copd),
"multi_system_disease": int(sum([has_endocrine, has_circulatory, has_renal, has_respiratory]) >= 2),
"charlson_proxy": int(ckd) * 2 + int(heart_failure) * 2 + int(has_circulatory),
}Step 5: Transform and Combine Clinical Labs
def engineer_lab_features(df: pd.DataFrame) -> pd.DataFrame:
features = df.copy()
# HbA1c: glycemic control categories
features["hba1c_controlled"] = (features["hba1c"] < 7.0).astype(int)
features["hba1c_poor_control"] = (features["hba1c"] >= 9.0).astype(int)
features["hba1c_missing"] = features["hba1c"].isnull().astype(int)
features["hba1c"] = features["hba1c"].fillna(features["hba1c"].median()) # impute after flagging
# Serum creatinine: eGFR proxy
features["elevated_creatinine"] = (features["serum_creatinine"] > 1.2).astype(int)
features["creatinine_log"] = np.log1p(features["serum_creatinine"])
# BMI
height_m = features["height_cm"] / 100
features["bmi"] = features["weight_kg"] / height_m ** 2
features["obese"] = (features["bmi"] >= 30).astype(int)
return features
# Why log creatinine? Creatinine is right-skewed — log transform makes it more Gaussian
# Why hba1c_missing flag? Missingness is MNAR — very high/low HbA1c not always drawnStep 6: Create Interaction Features
def engineer_interactions(df: pd.DataFrame) -> pd.DataFrame:
features = df.copy()
# Age × comorbidity burden (elderly patients with many conditions = higher risk)
features["age_x_diagnoses"] = features["age"] * features["num_diagnoses"]
# Polypharmacy + CKD (drug clearance reduced in CKD = higher toxicity risk)
features["meds_x_ckd"] = features["num_medications"] * features["has_ckd"]
# High readmission risk composite score
features["risk_score"] = (
features["prior_admissions_1yr"] * 3
+ features["num_diagnoses"]
+ features["has_heart_failure"] * 3
+ features["has_ckd"] * 2
+ features["length_of_stay"]
)
# Complex patient flag (multiple high-risk features)
features["complex_patient"] = (
(features["num_medications"] >= 10) &
(features["num_diagnoses"] >= 4) &
(features["prior_admissions_1yr"] >= 2)
).astype(int)
return featuresStep 7: Full Pipeline
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score, StratifiedKFold
def build_feature_matrix(df: pd.DataFrame) -> pd.DataFrame:
df = extract_time_features(df)
# Parse string columns
med_features = df["medications"].apply(lambda s: pd.Series(parse_medication_features(s)))
diag_features = df["diagnoses"].apply(lambda s: pd.Series(parse_diagnosis_features(s)))
df = pd.concat([df, med_features, diag_features], axis=1)
df = engineer_lab_features(df)
df = engineer_interactions(df)
# Drop raw columns the model can't use
drop_cols = ["patient_id", "admission_date", "discharge_date", "medications", "diagnoses"]
return df.drop(columns=drop_cols, errors="ignore")
X = build_feature_matrix(raw_ehr_df)
y = X.pop("readmitted_30d")
# Model
numeric_cols = X.select_dtypes(include=["number"]).columns.tolist()
categorical_cols = X.select_dtypes(include=["object", "category"]).columns.tolist()
preprocessor = ColumnTransformer([
("num", Pipeline([("imputer", SimpleImputer(strategy="median")), ("scaler", StandardScaler())]), numeric_cols),
("cat", OneHotEncoder(handle_unknown="ignore"), categorical_cols),
])
pipeline = Pipeline([
("preprocessor", preprocessor),
("model", GradientBoostingClassifier(n_estimators=200, max_depth=3, random_state=42)),
])
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(pipeline, X, y, cv=cv, scoring="roc_auc")
print(f"CV AUC: {scores.mean():.3f} ± {scores.std():.3f}")What Interviewers Want to Hear
- Domain knowledge drives feature choice — don't just enumerate transformations; explain why each feature predicts readmission
- Handle raw formats — parse comma-separated lists, timestamps, ICD codes, not just numeric columns
- MNAR awareness — flag missingness before imputing when missingness is informative
- Interactions — age × comorbidities, medication count × CKD capture relationships no individual feature captures
- Leakage awareness — fit encoders, imputers, scalers on training data only, via Pipeline
- Validate — measure whether engineered features actually improve CV AUC, not just add features
One-line answer: "I'd extract length of stay from timestamps, parse medications to get count and high-risk flags, map ICD-10 codes to comorbidity indicators, add a missingness flag for HbA1c before imputing, and create interaction features like age × comorbidity count and medication count × CKD. All this goes inside a Pipeline to prevent leakage — then I'd measure CV AUC with and without each group of features to confirm they actually help."
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.