Learnixo
Back to blog
AI Systemsintermediate

What is Feature Engineering?

Feature engineering fundamentals: transforming raw data into model-ready inputs, types of feature engineering, domain-driven feature creation, and why it often matters more than model choice.

Asma Hafeez KhanMay 16, 20265 min read
Machine LearningFeature EngineeringPreprocessingData ScienceInterview
Share:š•

What Feature Engineering Is

Feature engineering is the process of transforming raw data into inputs that a model can learn from effectively. It sits between raw data and model training.

Raw data → [Feature Engineering] → Feature matrix → [Model] → Predictions

A well-chosen feature can make a simple model outperform a complex model on raw inputs. Feature engineering is often where the most leverage is in a practical ML project.


Types of Feature Engineering

1. Extraction      — derive numeric signals from complex raw data
                     (count medications from a list, extract year from timestamp)

2. Transformation  — change the distribution or scale of existing features
                     (log-transform skewed values, bin age into categories)

3. Interaction     — combine features to capture relationships
                     (age Ɨ medication count, creatinine Ɨ weight)

4. Aggregation     — summarize sequences or groups into scalar features
                     (mean INR over past 6 months, admission count in past year)

5. Encoding        — convert categorical data to numeric form
                     (one-hot, ordinal, target encoding)

6. Imputation      — fill in missing values with meaningful estimates
                     (median for lab values, -1 sentinel for "not measured")

A Motivating Clinical Example

Python
import pandas as pd
import numpy as np

# Raw EHR data for a diabetic patient
raw_record = {
    "admission_date":   "2026-03-15",
    "discharge_date":   "2026-03-22",
    "medications":      ["metformin 1000mg", "lisinopril 10mg", "aspirin 81mg", "atorvastatin 40mg"],
    "diagnoses":        ["T2DM", "HTN", "HLD", "CKD stage 2"],
    "hba1c_history":    [8.2, 7.9, 9.1, 8.7],   # last 4 readings
    "age":              67,
    "weight_kg":        94,
    "serum_creatinine": 1.5,
    "prior_admissions": 2,
}

def engineer_features(record: dict) -> dict:
    admission = pd.to_datetime(record["admission_date"])
    discharge  = pd.to_datetime(record["discharge_date"])

    return {
        # Extraction
        "length_of_stay":      (discharge - admission).days,
        "num_medications":     len(record["medications"]),
        "num_diagnoses":       len(record["diagnoses"]),
        "on_metformin":        any("metformin" in m for m in record["medications"]),
        "on_anticoagulant":    any("warfarin" in m or "apixaban" in m for m in record["medications"]),

        # Aggregation from history
        "hba1c_mean":          np.mean(record["hba1c_history"]),
        "hba1c_max":           np.max(record["hba1c_history"]),
        "hba1c_trend":         record["hba1c_history"][-1] - record["hba1c_history"][0],
        "hba1c_unstable":      np.std(record["hba1c_history"]) > 0.5,

        # Transformation
        "eGFR_estimated":      140 / record["serum_creatinine"],   # simplified
        "bmi":                 record["weight_kg"] / (1.75 ** 2),

        # Interaction
        "age_x_ckd":           record["age"] * int("CKD" in " ".join(record["diagnoses"])),
        "complexity_score":    len(record["medications"]) + len(record["diagnoses"]) * 2,

        # Raw passthrough
        "age":                 record["age"],
        "prior_admissions":    record["prior_admissions"],
    }

features = engineer_features(raw_record)
for k, v in features.items():
    print(f"  {k:<25}: {v}")

Why Feature Engineering Often Beats Model Selection

Python
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score

# Experiment: raw features vs engineered features, simple vs complex model

X_raw       = ...   # 5 raw clinical features
X_engineered = ...  # 20 engineered features from same raw data

for model_name, model in [("Logistic Regression", LogisticRegression()), ("GBM", GradientBoostingClassifier())]:
    raw_cv  = cross_val_score(model, X_raw,        y, cv=5, scoring="roc_auc")
    eng_cv  = cross_val_score(model, X_engineered, y, cv=5, scoring="roc_auc")
    print(f"{model_name}")
    print(f"  Raw features:        {raw_cv.mean():.3f}")
    print(f"  Engineered features: {eng_cv.mean():.3f}")

# Typical result:
# Logistic + raw:        0.63
# GBM + raw:             0.71
# Logistic + engineered: 0.74   ← better than GBM on raw features
# GBM + engineered:      0.82   ← best (both good model and good features)

Log Transformation for Skewed Features

Python
# Clinical values often follow log-normal distributions: most patients are low,
# a few are very high (creatinine, CRP, troponin, LDH)

import numpy as np
import pandas as pd

crp_values = np.array([1.2, 1.5, 2.1, 0.8, 45.0, 3.2, 1.9, 120.0, 2.4, 1.1])

# Raw: skewed — a few high values dominate
# Log: compressed, more Gaussian-like
log_crp = np.log1p(crp_values)   # log1p = log(1+x), handles 0 safely

print("Raw CRP:    mean={:.1f}, std={:.1f}".format(crp_values.mean(), crp_values.std()))
print("Log CRP:    mean={:.2f}, std={:.2f}".format(log_crp.mean(), log_crp.std()))
# Log transform reduces std relative to mean — better for linear models

Binning / Bucketization

Python
# Convert continuous features to ordinal categories
# Useful when the relationship is non-linear and you want to capture it simply

import pandas as pd

ages = pd.Series([22, 35, 47, 61, 73, 55, 28, 80])

# Age groups used in clinical risk stratification
age_bins   = [0, 40, 55, 65, 100]
age_labels = ["young", "middle", "senior", "elderly"]

age_group = pd.cut(ages, bins=age_bins, labels=age_labels)
print(age_group.tolist())

# CKD staging from creatinine (simplified)
creatinine = pd.Series([0.8, 1.1, 1.8, 3.2, 5.5])
ckd_stage = pd.cut(creatinine, bins=[0, 1.2, 1.5, 2.0, 3.0, 20],
                   labels=["normal", "stage1", "stage2", "stage3", "stage4-5"])
print(ckd_stage.tolist())

Feature Engineering for NLP / LLM Pipelines

Python
# For text inputs to LLMs, feature engineering means:
# 1. Prompt construction — what information to include and in what format
# 2. Metadata as structured context — appending patient demographics to clinical notes
# 3. Retrieved context — RAG: retrieved relevant documents as additional features

def build_clinical_prompt(note: str, patient_meta: dict, retrieved_docs: list[str]) -> str:
    """
    Engineers the 'features' for an LLM: the full prompt context.
    """
    meta_str = (
        f"Patient: age {patient_meta['age']}, "
        f"{patient_meta['num_medications']} medications, "
        f"{'diabetic' if patient_meta['is_diabetic'] else 'non-diabetic'}"
    )
    context = "\n".join(f"- {doc}" for doc in retrieved_docs[:3])

    return f"""## Patient Context
{meta_str}

## Relevant Guidelines
{context}

## Clinical Note
{note}

## Task
Identify potential drug interactions and flag high-risk medications."""

# The "feature matrix" here is the prompt — engineering it is the same discipline

Interview Answer Template

Q: What is feature engineering and why does it matter?

Feature engineering is the process of transforming raw data into inputs that are meaningful and learnable by a model. Raw data — timestamps, free text, IDs, lists of medications — is almost never in a form models can use directly. Feature engineering extracts signals: length of stay from admission and discharge dates, medication count from a list, HbA1c trend from a history of values, BMI from weight and height. It also transforms distributions (log-scaling skewed lab values), creates interactions (age times comorbidity count), aggregates sequences (mean INR over 6 months), and handles missing data. Feature engineering often has more impact than model selection — a logistic regression with well-engineered features frequently outperforms gradient boosting on raw features. In LLM applications, prompt construction is the equivalent: deciding what context to include and how to structure it is feature engineering for the model's attention.

Enjoyed this article?

Explore the AI Systems learning path for more.

Found this helpful?

Share:š•

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.