Machine Learning Foundations · Lesson 36 of 70

What is Feature Engineering?

What Feature Engineering Is

Feature engineering is the process of transforming raw data into inputs that a model can learn from effectively. It sits between raw data and model training.

Raw data → [Feature Engineering] → Feature matrix → [Model] → Predictions

A well-chosen feature can make a simple model outperform a complex model on raw inputs. Feature engineering is often where the most leverage is in a practical ML project.

Types of Feature Engineering

1. Extraction      — derive numeric signals from complex raw data
                     (count medications from a list, extract year from timestamp)

2. Transformation  — change the distribution or scale of existing features
                     (log-transform skewed values, bin age into categories)

3. Interaction     — combine features to capture relationships
                     (age × medication count, creatinine × weight)

4. Aggregation     — summarize sequences or groups into scalar features
                     (mean INR over past 6 months, admission count in past year)

5. Encoding        — convert categorical data to numeric form
                     (one-hot, ordinal, target encoding)

6. Imputation      — fill in missing values with meaningful estimates
                     (median for lab values, -1 sentinel for "not measured")

A Motivating Clinical Example

Python

import pandas as pd
import numpy as np

# Raw EHR data for a diabetic patient
raw_record = {
    "admission_date":   "2026-03-15",
    "discharge_date":   "2026-03-22",
    "medications":      ["metformin 1000mg", "lisinopril 10mg", "aspirin 81mg", "atorvastatin 40mg"],
    "diagnoses":        ["T2DM", "HTN", "HLD", "CKD stage 2"],
    "hba1c_history":    [8.2, 7.9, 9.1, 8.7],   # last 4 readings
    "age":              67,
    "weight_kg":        94,
    "serum_creatinine": 1.5,
    "prior_admissions": 2,
}

def engineer_features(record: dict) -> dict:
    admission = pd.to_datetime(record["admission_date"])
    discharge  = pd.to_datetime(record["discharge_date"])

    return {
        # Extraction
        "length_of_stay":      (discharge - admission).days,
        "num_medications":     len(record["medications"]),
        "num_diagnoses":       len(record["diagnoses"]),
        "on_metformin":        any("metformin" in m for m in record["medications"]),
        "on_anticoagulant":    any("warfarin" in m or "apixaban" in m for m in record["medications"]),

        # Aggregation from history
        "hba1c_mean":          np.mean(record["hba1c_history"]),
        "hba1c_max":           np.max(record["hba1c_history"]),
        "hba1c_trend":         record["hba1c_history"][-1] - record["hba1c_history"][0],
        "hba1c_unstable":      np.std(record["hba1c_history"]) > 0.5,

        # Transformation
        "eGFR_estimated":      140 / record["serum_creatinine"],   # simplified
        "bmi":                 record["weight_kg"] / (1.75 ** 2),

        # Interaction
        "age_x_ckd":           record["age"] * int("CKD" in " ".join(record["diagnoses"])),
        "complexity_score":    len(record["medications"]) + len(record["diagnoses"]) * 2,

        # Raw passthrough
        "age":                 record["age"],
        "prior_admissions":    record["prior_admissions"],
    }

features = engineer_features(raw_record)
for k, v in features.items():
    print(f"  {k:<25}: {v}")

Why Feature Engineering Often Beats Model Selection

Python

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score

# Experiment: raw features vs engineered features, simple vs complex model

X_raw       = ...   # 5 raw clinical features
X_engineered = ...  # 20 engineered features from same raw data

for model_name, model in [("Logistic Regression", LogisticRegression()), ("GBM", GradientBoostingClassifier())]:
    raw_cv  = cross_val_score(model, X_raw,        y, cv=5, scoring="roc_auc")
    eng_cv  = cross_val_score(model, X_engineered, y, cv=5, scoring="roc_auc")
    print(f"{model_name}")
    print(f"  Raw features:        {raw_cv.mean():.3f}")
    print(f"  Engineered features: {eng_cv.mean():.3f}")

# Typical result:
# Logistic + raw:        0.63
# GBM + raw:             0.71
# Logistic + engineered: 0.74   ← better than GBM on raw features
# GBM + engineered:      0.82   ← best (both good model and good features)

Log Transformation for Skewed Features

Python

# Clinical values often follow log-normal distributions: most patients are low,
# a few are very high (creatinine, CRP, troponin, LDH)

import numpy as np
import pandas as pd

crp_values = np.array([1.2, 1.5, 2.1, 0.8, 45.0, 3.2, 1.9, 120.0, 2.4, 1.1])

# Raw: skewed — a few high values dominate
# Log: compressed, more Gaussian-like
log_crp = np.log1p(crp_values)   # log1p = log(1+x), handles 0 safely

print("Raw CRP:    mean={:.1f}, std={:.1f}".format(crp_values.mean(), crp_values.std()))
print("Log CRP:    mean={:.2f}, std={:.2f}".format(log_crp.mean(), log_crp.std()))
# Log transform reduces std relative to mean — better for linear models

Binning / Bucketization

Python

# Convert continuous features to ordinal categories
# Useful when the relationship is non-linear and you want to capture it simply

import pandas as pd

ages = pd.Series([22, 35, 47, 61, 73, 55, 28, 80])

# Age groups used in clinical risk stratification
age_bins   = [0, 40, 55, 65, 100]
age_labels = ["young", "middle", "senior", "elderly"]

age_group = pd.cut(ages, bins=age_bins, labels=age_labels)
print(age_group.tolist())

# CKD staging from creatinine (simplified)
creatinine = pd.Series([0.8, 1.1, 1.8, 3.2, 5.5])
ckd_stage = pd.cut(creatinine, bins=[0, 1.2, 1.5, 2.0, 3.0, 20],
                   labels=["normal", "stage1", "stage2", "stage3", "stage4-5"])
print(ckd_stage.tolist())

Feature Engineering for NLP / LLM Pipelines

Python

# For text inputs to LLMs, feature engineering means:
# 1. Prompt construction — what information to include and in what format
# 2. Metadata as structured context — appending patient demographics to clinical notes
# 3. Retrieved context — RAG: retrieved relevant documents as additional features

def build_clinical_prompt(note: str, patient_meta: dict, retrieved_docs: list[str]) -> str:
    """
    Engineers the 'features' for an LLM: the full prompt context.
    """
    meta_str = (
        f"Patient: age {patient_meta['age']}, "
        f"{patient_meta['num_medications']} medications, "
        f"{'diabetic' if patient_meta['is_diabetic'] else 'non-diabetic'}"
    )
    context = "\n".join(f"- {doc}" for doc in retrieved_docs[:3])

    return f"""## Patient Context
{meta_str}

## Relevant Guidelines
{context}

## Clinical Note
{note}

## Task
Identify potential drug interactions and flag high-risk medications."""

# The "feature matrix" here is the prompt — engineering it is the same discipline

Interview Answer Template

Q: What is feature engineering and why does it matter?

Feature engineering is the process of transforming raw data into inputs that are meaningful and learnable by a model. Raw data — timestamps, free text, IDs, lists of medications — is almost never in a form models can use directly. Feature engineering extracts signals: length of stay from admission and discharge dates, medication count from a list, HbA1c trend from a history of values, BMI from weight and height. It also transforms distributions (log-scaling skewed lab values), creates interactions (age times comorbidity count), aggregates sequences (mean INR over 6 months), and handles missing data. Feature engineering often has more impact than model selection — a logistic regression with well-engineered features frequently outperforms gradient boosting on raw features. In LLM applications, prompt construction is the equivalent: deciding what context to include and how to structure it is feature engineering for the model's attention.

Which Algorithms Require Feature Scaling?

Next Lesson

How to Handle Missing Values