What is Feature Engineering?
Feature engineering fundamentals: transforming raw data into model-ready inputs, types of feature engineering, domain-driven feature creation, and why it often matters more than model choice.
What Feature Engineering Is
Feature engineering is the process of transforming raw data into inputs that a model can learn from effectively. It sits between raw data and model training.
Raw data ā [Feature Engineering] ā Feature matrix ā [Model] ā PredictionsA well-chosen feature can make a simple model outperform a complex model on raw inputs. Feature engineering is often where the most leverage is in a practical ML project.
Types of Feature Engineering
1. Extraction ā derive numeric signals from complex raw data
(count medications from a list, extract year from timestamp)
2. Transformation ā change the distribution or scale of existing features
(log-transform skewed values, bin age into categories)
3. Interaction ā combine features to capture relationships
(age Ć medication count, creatinine Ć weight)
4. Aggregation ā summarize sequences or groups into scalar features
(mean INR over past 6 months, admission count in past year)
5. Encoding ā convert categorical data to numeric form
(one-hot, ordinal, target encoding)
6. Imputation ā fill in missing values with meaningful estimates
(median for lab values, -1 sentinel for "not measured")A Motivating Clinical Example
import pandas as pd
import numpy as np
# Raw EHR data for a diabetic patient
raw_record = {
"admission_date": "2026-03-15",
"discharge_date": "2026-03-22",
"medications": ["metformin 1000mg", "lisinopril 10mg", "aspirin 81mg", "atorvastatin 40mg"],
"diagnoses": ["T2DM", "HTN", "HLD", "CKD stage 2"],
"hba1c_history": [8.2, 7.9, 9.1, 8.7], # last 4 readings
"age": 67,
"weight_kg": 94,
"serum_creatinine": 1.5,
"prior_admissions": 2,
}
def engineer_features(record: dict) -> dict:
admission = pd.to_datetime(record["admission_date"])
discharge = pd.to_datetime(record["discharge_date"])
return {
# Extraction
"length_of_stay": (discharge - admission).days,
"num_medications": len(record["medications"]),
"num_diagnoses": len(record["diagnoses"]),
"on_metformin": any("metformin" in m for m in record["medications"]),
"on_anticoagulant": any("warfarin" in m or "apixaban" in m for m in record["medications"]),
# Aggregation from history
"hba1c_mean": np.mean(record["hba1c_history"]),
"hba1c_max": np.max(record["hba1c_history"]),
"hba1c_trend": record["hba1c_history"][-1] - record["hba1c_history"][0],
"hba1c_unstable": np.std(record["hba1c_history"]) > 0.5,
# Transformation
"eGFR_estimated": 140 / record["serum_creatinine"], # simplified
"bmi": record["weight_kg"] / (1.75 ** 2),
# Interaction
"age_x_ckd": record["age"] * int("CKD" in " ".join(record["diagnoses"])),
"complexity_score": len(record["medications"]) + len(record["diagnoses"]) * 2,
# Raw passthrough
"age": record["age"],
"prior_admissions": record["prior_admissions"],
}
features = engineer_features(raw_record)
for k, v in features.items():
print(f" {k:<25}: {v}")Why Feature Engineering Often Beats Model Selection
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score
# Experiment: raw features vs engineered features, simple vs complex model
X_raw = ... # 5 raw clinical features
X_engineered = ... # 20 engineered features from same raw data
for model_name, model in [("Logistic Regression", LogisticRegression()), ("GBM", GradientBoostingClassifier())]:
raw_cv = cross_val_score(model, X_raw, y, cv=5, scoring="roc_auc")
eng_cv = cross_val_score(model, X_engineered, y, cv=5, scoring="roc_auc")
print(f"{model_name}")
print(f" Raw features: {raw_cv.mean():.3f}")
print(f" Engineered features: {eng_cv.mean():.3f}")
# Typical result:
# Logistic + raw: 0.63
# GBM + raw: 0.71
# Logistic + engineered: 0.74 ā better than GBM on raw features
# GBM + engineered: 0.82 ā best (both good model and good features)Log Transformation for Skewed Features
# Clinical values often follow log-normal distributions: most patients are low,
# a few are very high (creatinine, CRP, troponin, LDH)
import numpy as np
import pandas as pd
crp_values = np.array([1.2, 1.5, 2.1, 0.8, 45.0, 3.2, 1.9, 120.0, 2.4, 1.1])
# Raw: skewed ā a few high values dominate
# Log: compressed, more Gaussian-like
log_crp = np.log1p(crp_values) # log1p = log(1+x), handles 0 safely
print("Raw CRP: mean={:.1f}, std={:.1f}".format(crp_values.mean(), crp_values.std()))
print("Log CRP: mean={:.2f}, std={:.2f}".format(log_crp.mean(), log_crp.std()))
# Log transform reduces std relative to mean ā better for linear modelsBinning / Bucketization
# Convert continuous features to ordinal categories
# Useful when the relationship is non-linear and you want to capture it simply
import pandas as pd
ages = pd.Series([22, 35, 47, 61, 73, 55, 28, 80])
# Age groups used in clinical risk stratification
age_bins = [0, 40, 55, 65, 100]
age_labels = ["young", "middle", "senior", "elderly"]
age_group = pd.cut(ages, bins=age_bins, labels=age_labels)
print(age_group.tolist())
# CKD staging from creatinine (simplified)
creatinine = pd.Series([0.8, 1.1, 1.8, 3.2, 5.5])
ckd_stage = pd.cut(creatinine, bins=[0, 1.2, 1.5, 2.0, 3.0, 20],
labels=["normal", "stage1", "stage2", "stage3", "stage4-5"])
print(ckd_stage.tolist())Feature Engineering for NLP / LLM Pipelines
# For text inputs to LLMs, feature engineering means:
# 1. Prompt construction ā what information to include and in what format
# 2. Metadata as structured context ā appending patient demographics to clinical notes
# 3. Retrieved context ā RAG: retrieved relevant documents as additional features
def build_clinical_prompt(note: str, patient_meta: dict, retrieved_docs: list[str]) -> str:
"""
Engineers the 'features' for an LLM: the full prompt context.
"""
meta_str = (
f"Patient: age {patient_meta['age']}, "
f"{patient_meta['num_medications']} medications, "
f"{'diabetic' if patient_meta['is_diabetic'] else 'non-diabetic'}"
)
context = "\n".join(f"- {doc}" for doc in retrieved_docs[:3])
return f"""## Patient Context
{meta_str}
## Relevant Guidelines
{context}
## Clinical Note
{note}
## Task
Identify potential drug interactions and flag high-risk medications."""
# The "feature matrix" here is the prompt ā engineering it is the same disciplineInterview Answer Template
Q: What is feature engineering and why does it matter?
Feature engineering is the process of transforming raw data into inputs that are meaningful and learnable by a model. Raw data ā timestamps, free text, IDs, lists of medications ā is almost never in a form models can use directly. Feature engineering extracts signals: length of stay from admission and discharge dates, medication count from a list, HbA1c trend from a history of values, BMI from weight and height. It also transforms distributions (log-scaling skewed lab values), creates interactions (age times comorbidity count), aggregates sequences (mean INR over 6 months), and handles missing data. Feature engineering often has more impact than model selection ā a logistic regression with well-engineered features frequently outperforms gradient boosting on raw features. In LLM applications, prompt construction is the equivalent: deciding what context to include and how to structure it is feature engineering for the model's attention.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.