Learnixo
Back to blog
AI Systemsadvanced

Interview: Feature Engineering Scenario

Interview walk-through: engineer features from raw EHR data for a 30-day readmission model — covering extraction, transformation, interactions, handling missing values, and validating feature quality.

Asma Hafeez KhanMay 16, 20266 min read
Machine LearningInterviewFeature EngineeringClinical AIEHR
Share:𝕏

The Scenario

You're given a raw EHR dataset with admission records for diabetic patients. You need to build a 30-day readmission prediction model. The dataset has: patient demographics, admission/discharge timestamps, a comma-separated medications field, a list of ICD-10 diagnosis codes, and the most recent HbA1c value. What features would you engineer?


Step 1: Understand the Raw Data

Python
import pandas as pd
import numpy as np

# Raw record  what you get from the EHR
raw = pd.DataFrame({
    "patient_id":       [1001],
    "admission_date":   ["2026-03-15"],
    "discharge_date":   ["2026-03-22"],
    "age":              [67],
    "gender":           ["M"],
    "weight_kg":        [94],
    "height_cm":        [175],
    "medications":      ["metformin 1000mg, lisinopril 10mg, aspirin 81mg, atorvastatin 40mg, insulin glargine"],
    "diagnoses":        ["E11.9, I10, N18.3, E78.5"],
    "hba1c":            [8.7],
    "serum_creatinine": [1.5],
    "prior_admissions_1yr": [2],
    "readmitted_30d":   [1],   # target
})

# Immediately ask:
# 1. What are the data types? (timestamps as strings, lists as strings)
# 2. What's missing? (HbA1c null for many patients)
# 3. What domain knowledge applies? (HbA1c > 8 = poor control, creatinine > 1.2 = CKD)
print(raw.dtypes)
print(raw.isnull().sum())

Step 2: Extract Features from Timestamps

Python
def extract_time_features(df: pd.DataFrame) -> pd.DataFrame:
    admission = pd.to_datetime(df["admission_date"])
    discharge  = pd.to_datetime(df["discharge_date"])

    return df.assign(
        length_of_stay     = (discharge - admission).dt.days,
        admission_month    = admission.dt.month,
        admission_dayofweek = admission.dt.dayofweek,   # 0=Mon, 6=Sun
        admitted_on_weekend = (admission.dt.dayofweek >= 5).astype(int),
    )

# Why length_of_stay? Longer stays  more complex illness  higher readmission risk
# Why admission month? Seasonal infections (flu season) affect readmission

Step 3: Parse and Engineer from Medication List

Python
ANTICOAGULANTS   = {"warfarin", "apixaban", "rivaroxaban", "heparin", "enoxaparin"}
ANTIDIABETICS    = {"metformin", "insulin", "glargine", "glipizide", "sitagliptin"}
HIGH_RISK_MEDS   = {"warfarin", "insulin", "digoxin", "lithium", "methotrexate"}

def parse_medication_features(medications_str: str) -> dict:
    meds = [m.strip().split()[0].lower() for m in medications_str.split(",")]

    return {
        "num_medications":    len(meds),
        "on_anticoagulant":   int(any(m in ANTICOAGULANTS for m in meds)),
        "on_insulin":         int(any("insulin" in m for m in meds)),
        "on_metformin":       int("metformin" in meds),
        "high_risk_med_count": sum(1 for m in meds if any(hrm in m for hrm in HIGH_RISK_MEDS)),
        "med_complexity":     len(meds) * (1 + sum(1 for m in meds if any(hrm in m for hrm in HIGH_RISK_MEDS))),
    }

# High medication count  polypharmacy  higher readmission risk
# High-risk medications require closer monitoring  different readmission dynamics

Step 4: Parse and Engineer from Diagnosis Codes

Python
def parse_diagnosis_features(diagnoses_str: str) -> dict:
    codes = [d.strip() for d in diagnoses_str.split(",")]

    # ICD-10 chapter flags (first character)
    has_endocrine  = any(c.startswith("E") for c in codes)   # diabetes, thyroid
    has_circulatory = any(c.startswith("I") for c in codes)  # hypertension, heart
    has_renal      = any(c.startswith("N") for c in codes)   # CKD, renal failure
    has_respiratory = any(c.startswith("J") for c in codes)  # COPD, pneumonia

    # Specific high-risk conditions
    ckd            = any(c.startswith("N18") for c in codes)
    heart_failure  = any(c.startswith("I50") for c in codes)
    copd           = any(c.startswith("J44") for c in codes)

    return {
        "num_diagnoses":         len(codes),
        "has_ckd":               int(ckd),
        "has_heart_failure":     int(heart_failure),
        "has_copd":              int(copd),
        "num_high_risk_comorbidities": int(ckd) + int(heart_failure) + int(copd),
        "multi_system_disease":  int(sum([has_endocrine, has_circulatory, has_renal, has_respiratory]) >= 2),
        "charlson_proxy":        int(ckd) * 2 + int(heart_failure) * 2 + int(has_circulatory),
    }

Step 5: Transform and Combine Clinical Labs

Python
def engineer_lab_features(df: pd.DataFrame) -> pd.DataFrame:
    features = df.copy()

    # HbA1c: glycemic control categories
    features["hba1c_controlled"]    = (features["hba1c"] < 7.0).astype(int)
    features["hba1c_poor_control"]  = (features["hba1c"] >= 9.0).astype(int)
    features["hba1c_missing"]       = features["hba1c"].isnull().astype(int)
    features["hba1c"] = features["hba1c"].fillna(features["hba1c"].median())  # impute after flagging

    # Serum creatinine: eGFR proxy
    features["elevated_creatinine"] = (features["serum_creatinine"] > 1.2).astype(int)
    features["creatinine_log"]      = np.log1p(features["serum_creatinine"])

    # BMI
    height_m = features["height_cm"] / 100
    features["bmi"]      = features["weight_kg"] / height_m ** 2
    features["obese"]    = (features["bmi"] >= 30).astype(int)

    return features

# Why log creatinine? Creatinine is right-skewed  log transform makes it more Gaussian
# Why hba1c_missing flag? Missingness is MNAR  very high/low HbA1c not always drawn

Step 6: Create Interaction Features

Python
def engineer_interactions(df: pd.DataFrame) -> pd.DataFrame:
    features = df.copy()

    # Age × comorbidity burden (elderly patients with many conditions = higher risk)
    features["age_x_diagnoses"]   = features["age"] * features["num_diagnoses"]

    # Polypharmacy + CKD (drug clearance reduced in CKD = higher toxicity risk)
    features["meds_x_ckd"]        = features["num_medications"] * features["has_ckd"]

    # High readmission risk composite score
    features["risk_score"] = (
        features["prior_admissions_1yr"] * 3
        + features["num_diagnoses"]
        + features["has_heart_failure"] * 3
        + features["has_ckd"] * 2
        + features["length_of_stay"]
    )

    # Complex patient flag (multiple high-risk features)
    features["complex_patient"] = (
        (features["num_medications"] >= 10) &
        (features["num_diagnoses"] >= 4) &
        (features["prior_admissions_1yr"] >= 2)
    ).astype(int)

    return features

Step 7: Full Pipeline

Python
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score, StratifiedKFold

def build_feature_matrix(df: pd.DataFrame) -> pd.DataFrame:
    df = extract_time_features(df)

    # Parse string columns
    med_features  = df["medications"].apply(lambda s: pd.Series(parse_medication_features(s)))
    diag_features = df["diagnoses"].apply(lambda s: pd.Series(parse_diagnosis_features(s)))

    df = pd.concat([df, med_features, diag_features], axis=1)
    df = engineer_lab_features(df)
    df = engineer_interactions(df)

    # Drop raw columns the model can't use
    drop_cols = ["patient_id", "admission_date", "discharge_date", "medications", "diagnoses"]
    return df.drop(columns=drop_cols, errors="ignore")

X = build_feature_matrix(raw_ehr_df)
y = X.pop("readmitted_30d")

# Model
numeric_cols     = X.select_dtypes(include=["number"]).columns.tolist()
categorical_cols = X.select_dtypes(include=["object", "category"]).columns.tolist()

preprocessor = ColumnTransformer([
    ("num", Pipeline([("imputer", SimpleImputer(strategy="median")), ("scaler", StandardScaler())]), numeric_cols),
    ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_cols),
])

pipeline = Pipeline([
    ("preprocessor", preprocessor),
    ("model", GradientBoostingClassifier(n_estimators=200, max_depth=3, random_state=42)),
])

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(pipeline, X, y, cv=cv, scoring="roc_auc")
print(f"CV AUC: {scores.mean():.3f} ± {scores.std():.3f}")

What Interviewers Want to Hear

  1. Domain knowledge drives feature choice — don't just enumerate transformations; explain why each feature predicts readmission
  2. Handle raw formats — parse comma-separated lists, timestamps, ICD codes, not just numeric columns
  3. MNAR awareness — flag missingness before imputing when missingness is informative
  4. Interactions — age × comorbidities, medication count × CKD capture relationships no individual feature captures
  5. Leakage awareness — fit encoders, imputers, scalers on training data only, via Pipeline
  6. Validate — measure whether engineered features actually improve CV AUC, not just add features

One-line answer: "I'd extract length of stay from timestamps, parse medications to get count and high-risk flags, map ICD-10 codes to comorbidity indicators, add a missingness flag for HbA1c before imputing, and create interaction features like age × comorbidity count and medication count × CKD. All this goes inside a Pipeline to prevent leakage — then I'd measure CV AUC with and without each group of features to confirm they actually help."

Enjoyed this article?

Explore the AI Systems learning path for more.

Found this helpful?

Share:𝕏

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.