Machine Learning Foundations · Lesson 38 of 70

Encoding Categorical Variables: One-Hot, Label, Target

Why Categorical Features Need Encoding

ML models operate on numbers. Categories — discharge location, drug class, diagnosis code, insurance type — must be converted to numeric form. The choice of encoding affects model performance and interpretability.

One-Hot Encoding

Create a binary column for each category. Most common choice for nominal categories (no order).

Python

import pandas as pd
from sklearn.preprocessing import OneHotEncoder
import numpy as np

# Discharge destination (nominal — no inherent order)
df = pd.DataFrame({
    "discharge_to": ["home", "SNF", "rehab", "home", "SNF", "home_with_help"],
})

# pandas get_dummies — quick and readable
dummies = pd.get_dummies(df["discharge_to"], prefix="discharge")
print(dummies.astype(int))

#    discharge_SNF  discharge_home  discharge_home_with_help  discharge_rehab
# 0              0               1                         0                0
# 1              1               0                         0                0
# ...

# sklearn OneHotEncoder — preferred for pipelines (handles unseen categories)
ohe = OneHotEncoder(sparse_output=False, handle_unknown="ignore")
X_ohe = ohe.fit_transform(df[["discharge_to"]])
print(ohe.categories_)

The Dummy Variable Trap

Python

# With k categories, you only need k-1 binary columns
# The k-th is perfectly predicted by the others (collinearity)
# Important for linear models; harmless for trees

ohe_drop = OneHotEncoder(drop="first", sparse_output=False)
X_no_trap = ohe_drop.fit_transform(df[["discharge_to"]])
# One fewer column — linear regression won't have collinearity issue

Ordinal Encoding

Map categories to integers, preserving their order. Use only when a natural order exists.

Python

from sklearn.preprocessing import OrdinalEncoder

# CKD stage has a meaningful order
ckd_stages = pd.DataFrame({
    "ckd_stage": ["stage1", "stage3", "stage2", "stage4", "stage1", "stage5"]
})

oe = OrdinalEncoder(categories=[["stage1", "stage2", "stage3", "stage4", "stage5"]])
ckd_encoded = oe.fit_transform(ckd_stages)
print(ckd_encoded.flatten())   # [0, 2, 1, 3, 0, 4]

# Other ordinal examples:
# HbA1c control: "well_controlled" < "moderate" < "poor"
# Disease severity: "mild" < "moderate" < "severe"
# Education level: "none" < "high_school" < "college" < "graduate"

Target Encoding

Replace each category with the mean of the target for that group. Useful for high-cardinality categoricals.

Python

import pandas as pd
import numpy as np

def target_encode(df: pd.DataFrame, col: str, target: str, smooth: float = 1.0) -> pd.Series:
    """
    Smoothed target encoding: blend category mean with global mean.
    smooth=0 → pure category mean (overfits rare categories)
    smooth=large → pulls toward global mean (safer for rare categories)
    """
    global_mean = df[target].mean()
    stats = df.groupby(col)[target].agg(["mean", "count"])
    # Smoothed: (count * category_mean + smooth * global_mean) / (count + smooth)
    smoothed = (stats["count"] * stats["mean"] + smooth * global_mean) / (stats["count"] + smooth)
    return df[col].map(smoothed).fillna(global_mean)

# Example: drug class → 30-day readmission rate
df_drugs = pd.DataFrame({
    "drug_class":  ["anticoagulant", "antidiabetic", "antihypertensive", "anticoagulant",
                    "antibiotic", "antidiabetic", "antihypertensive", "anticoagulant"],
    "readmitted":  [1, 0, 0, 1, 0, 1, 0, 1],
})

df_drugs["drug_class_encoded"] = target_encode(df_drugs, "drug_class", "readmitted", smooth=1.0)
print(df_drugs[["drug_class", "drug_class_encoded"]].drop_duplicates())

Target Encoding Leakage Warning

Python

# Target encoding MUST be computed on training data only
# Fitting on all data leaks validation/test targets into training features

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df_drugs, df_drugs["readmitted"], test_size=0.3)

# Compute encoding from training data
train_means = y_train.groupby(X_train["drug_class"]).mean()
global_mean = y_train.mean()

# Apply to test using training statistics
X_test["drug_class_encoded"] = X_test["drug_class"].map(train_means).fillna(global_mean)

High-Cardinality: ICD-10 Diagnosis Codes

Python

# ICD-10 codes: thousands of unique values
# One-hot encoding → thousands of columns (too sparse)
# Target encoding → manageable with smoothing
# Hierarchical rollup → use the first 3 characters (category level)

diagnoses = pd.Series(["E11.9", "I10", "E11.641", "N18.3", "E11.9", "Z99.2"])

# Option 1: Roll up to ICD chapter
diagnoses_category = diagnoses.str[:3]
print(diagnoses_category.unique())   # ['E11', 'I10', 'N18', 'Z99']
# Now only 4 unique values instead of 5

# Option 2: Binary flags for clinically important categories
icd_flags = pd.DataFrame({
    "diabetes":       diagnoses.str.startswith("E11").astype(int),
    "hypertension":   diagnoses.str.startswith("I10").astype(int),
    "ckd":            diagnoses.str.startswith("N18").astype(int),
})
print(icd_flags)

Choosing the Right Encoding

| Feature Type | Cardinality | Recommended Encoding | |---|---|---| | Nominal (no order) | Low (2–10) | One-hot | | Nominal | High (10–1000+) | Target encoding or frequency encoding | | Ordinal (has order) | Any | Ordinal encoding | | Binary (yes/no) | 2 | Binary (0/1) directly | | Cyclical (months, hours) | Fixed | Sin/cos transform | | Text / free text | N/A | TF-IDF, embeddings |

Encoding in a Pipeline

Python

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

numeric_cols     = ["age", "serum_creatinine", "num_medications"]
onehot_cols      = ["discharge_to", "insurance_type"]
ordinal_cols     = ["ckd_stage"]
ordinal_order    = [["stage1", "stage2", "stage3", "stage4", "stage5"]]

preprocessor = ColumnTransformer([
    ("num", Pipeline([
        ("imputer", SimpleImputer(strategy="median")),
        ("scaler",  StandardScaler()),
    ]), numeric_cols),

    ("ohe", Pipeline([
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("encoder", OneHotEncoder(handle_unknown="ignore", sparse_output=False)),
    ]), onehot_cols),

    ("ord", Pipeline([
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("encoder", OrdinalEncoder(categories=ordinal_order)),
    ]), ordinal_cols),
])

pipeline = Pipeline([
    ("preprocessor", preprocessor),
    ("model", LogisticRegression(max_iter=1000)),
])

from sklearn.model_selection import cross_val_score
scores = cross_val_score(pipeline, X, y, cv=5, scoring="roc_auc")
print(f"Full pipeline CV AUC: {scores.mean():.3f} ± {scores.std():.3f}")

Interview Answer Template

Q: How do you encode categorical variables in ML?

The choice depends on whether the category has a meaningful order and how many unique values it has. For nominal categories with few values (discharge location, insurance type), I use one-hot encoding — a binary column per category. For ordinal categories with a natural order (CKD stage 1–5, disease severity mild/moderate/severe), I use ordinal encoding to preserve the ordering. For high-cardinality categoricals like ICD-10 codes (thousands of values), one-hot encoding produces sparse, unmanageable matrices — I either roll up to a coarser level (ICD chapter), use target encoding (replace with per-category readmission rate from training data), or extract binary flags for clinically important groups. Target encoding requires care: compute it only from training data to avoid leaking validation/test labels. Everything goes inside a sklearn Pipeline so cross-validation handles encoding correctly across folds.

How to Handle Missing Values

Next Lesson

Feature Selection: Filter, Wrapper, Embedded