Machine Learning Foundations · Lesson 38 of 70
Encoding Categorical Variables: One-Hot, Label, Target
Why Categorical Features Need Encoding
ML models operate on numbers. Categories — discharge location, drug class, diagnosis code, insurance type — must be converted to numeric form. The choice of encoding affects model performance and interpretability.
One-Hot Encoding
Create a binary column for each category. Most common choice for nominal categories (no order).
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
import numpy as np
# Discharge destination (nominal — no inherent order)
df = pd.DataFrame({
"discharge_to": ["home", "SNF", "rehab", "home", "SNF", "home_with_help"],
})
# pandas get_dummies — quick and readable
dummies = pd.get_dummies(df["discharge_to"], prefix="discharge")
print(dummies.astype(int))
# discharge_SNF discharge_home discharge_home_with_help discharge_rehab
# 0 0 1 0 0
# 1 1 0 0 0
# ...
# sklearn OneHotEncoder — preferred for pipelines (handles unseen categories)
ohe = OneHotEncoder(sparse_output=False, handle_unknown="ignore")
X_ohe = ohe.fit_transform(df[["discharge_to"]])
print(ohe.categories_)The Dummy Variable Trap
# With k categories, you only need k-1 binary columns
# The k-th is perfectly predicted by the others (collinearity)
# Important for linear models; harmless for trees
ohe_drop = OneHotEncoder(drop="first", sparse_output=False)
X_no_trap = ohe_drop.fit_transform(df[["discharge_to"]])
# One fewer column — linear regression won't have collinearity issueOrdinal Encoding
Map categories to integers, preserving their order. Use only when a natural order exists.
from sklearn.preprocessing import OrdinalEncoder
# CKD stage has a meaningful order
ckd_stages = pd.DataFrame({
"ckd_stage": ["stage1", "stage3", "stage2", "stage4", "stage1", "stage5"]
})
oe = OrdinalEncoder(categories=[["stage1", "stage2", "stage3", "stage4", "stage5"]])
ckd_encoded = oe.fit_transform(ckd_stages)
print(ckd_encoded.flatten()) # [0, 2, 1, 3, 0, 4]
# Other ordinal examples:
# HbA1c control: "well_controlled" < "moderate" < "poor"
# Disease severity: "mild" < "moderate" < "severe"
# Education level: "none" < "high_school" < "college" < "graduate"Target Encoding
Replace each category with the mean of the target for that group. Useful for high-cardinality categoricals.
import pandas as pd
import numpy as np
def target_encode(df: pd.DataFrame, col: str, target: str, smooth: float = 1.0) -> pd.Series:
"""
Smoothed target encoding: blend category mean with global mean.
smooth=0 → pure category mean (overfits rare categories)
smooth=large → pulls toward global mean (safer for rare categories)
"""
global_mean = df[target].mean()
stats = df.groupby(col)[target].agg(["mean", "count"])
# Smoothed: (count * category_mean + smooth * global_mean) / (count + smooth)
smoothed = (stats["count"] * stats["mean"] + smooth * global_mean) / (stats["count"] + smooth)
return df[col].map(smoothed).fillna(global_mean)
# Example: drug class → 30-day readmission rate
df_drugs = pd.DataFrame({
"drug_class": ["anticoagulant", "antidiabetic", "antihypertensive", "anticoagulant",
"antibiotic", "antidiabetic", "antihypertensive", "anticoagulant"],
"readmitted": [1, 0, 0, 1, 0, 1, 0, 1],
})
df_drugs["drug_class_encoded"] = target_encode(df_drugs, "drug_class", "readmitted", smooth=1.0)
print(df_drugs[["drug_class", "drug_class_encoded"]].drop_duplicates())Target Encoding Leakage Warning
# Target encoding MUST be computed on training data only
# Fitting on all data leaks validation/test targets into training features
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df_drugs, df_drugs["readmitted"], test_size=0.3)
# Compute encoding from training data
train_means = y_train.groupby(X_train["drug_class"]).mean()
global_mean = y_train.mean()
# Apply to test using training statistics
X_test["drug_class_encoded"] = X_test["drug_class"].map(train_means).fillna(global_mean)High-Cardinality: ICD-10 Diagnosis Codes
# ICD-10 codes: thousands of unique values
# One-hot encoding → thousands of columns (too sparse)
# Target encoding → manageable with smoothing
# Hierarchical rollup → use the first 3 characters (category level)
diagnoses = pd.Series(["E11.9", "I10", "E11.641", "N18.3", "E11.9", "Z99.2"])
# Option 1: Roll up to ICD chapter
diagnoses_category = diagnoses.str[:3]
print(diagnoses_category.unique()) # ['E11', 'I10', 'N18', 'Z99']
# Now only 4 unique values instead of 5
# Option 2: Binary flags for clinically important categories
icd_flags = pd.DataFrame({
"diabetes": diagnoses.str.startswith("E11").astype(int),
"hypertension": diagnoses.str.startswith("I10").astype(int),
"ckd": diagnoses.str.startswith("N18").astype(int),
})
print(icd_flags)Choosing the Right Encoding
| Feature Type | Cardinality | Recommended Encoding | |---|---|---| | Nominal (no order) | Low (2–10) | One-hot | | Nominal | High (10–1000+) | Target encoding or frequency encoding | | Ordinal (has order) | Any | Ordinal encoding | | Binary (yes/no) | 2 | Binary (0/1) directly | | Cyclical (months, hours) | Fixed | Sin/cos transform | | Text / free text | N/A | TF-IDF, embeddings |
Encoding in a Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
numeric_cols = ["age", "serum_creatinine", "num_medications"]
onehot_cols = ["discharge_to", "insurance_type"]
ordinal_cols = ["ckd_stage"]
ordinal_order = [["stage1", "stage2", "stage3", "stage4", "stage5"]]
preprocessor = ColumnTransformer([
("num", Pipeline([
("imputer", SimpleImputer(strategy="median")),
("scaler", StandardScaler()),
]), numeric_cols),
("ohe", Pipeline([
("imputer", SimpleImputer(strategy="most_frequent")),
("encoder", OneHotEncoder(handle_unknown="ignore", sparse_output=False)),
]), onehot_cols),
("ord", Pipeline([
("imputer", SimpleImputer(strategy="most_frequent")),
("encoder", OrdinalEncoder(categories=ordinal_order)),
]), ordinal_cols),
])
pipeline = Pipeline([
("preprocessor", preprocessor),
("model", LogisticRegression(max_iter=1000)),
])
from sklearn.model_selection import cross_val_score
scores = cross_val_score(pipeline, X, y, cv=5, scoring="roc_auc")
print(f"Full pipeline CV AUC: {scores.mean():.3f} ± {scores.std():.3f}")Interview Answer Template
Q: How do you encode categorical variables in ML?
The choice depends on whether the category has a meaningful order and how many unique values it has. For nominal categories with few values (discharge location, insurance type), I use one-hot encoding — a binary column per category. For ordinal categories with a natural order (CKD stage 1–5, disease severity mild/moderate/severe), I use ordinal encoding to preserve the ordering. For high-cardinality categoricals like ICD-10 codes (thousands of values), one-hot encoding produces sparse, unmanageable matrices — I either roll up to a coarser level (ICD chapter), use target encoding (replace with per-category readmission rate from training data), or extract binary flags for clinically important groups. Target encoding requires care: compute it only from training data to avoid leaking validation/test labels. Everything goes inside a sklearn Pipeline so cross-validation handles encoding correctly across folds.