Machine Learning Foundations · Lesson 39 of 70

Feature Selection: Filter, Wrapper, Embedded

Why Feature Selection Matters

More features are not always better:

Irrelevant features add noise → higher variance → worse generalization
Correlated features confuse coefficient-based models
Many features increase computational cost
Fewer features → simpler model → easier to interpret and explain

Filter Methods: No Model Needed

Correlation with the Target

Python

import pandas as pd
import numpy as np

# Select features based on Pearson correlation with target (for continuous targets)
def select_by_correlation(X: pd.DataFrame, y: pd.Series, threshold: float = 0.1) -> list[str]:
    correlations = X.corrwith(y).abs()
    return correlations[correlations >= threshold].sort_values(ascending=False).index.tolist()

# Example: predicting INR from patient features
selected = select_by_correlation(X_df, y_inr, threshold=0.1)
print("Selected by correlation:", selected)

Mutual Information

Python

from sklearn.feature_selection import mutual_info_classif, mutual_info_regression
from sklearn.feature_selection import SelectKBest, f_classif
import numpy as np

# Mutual information: captures non-linear relationships (unlike correlation)
# Better for clinical data where relationships are non-linear

# For classification
mi_scores = mutual_info_classif(X_train, y_train, random_state=42)
feature_mi = sorted(zip(feature_names, mi_scores), key=lambda x: x[1], reverse=True)

print("Features ranked by mutual information:")
for name, score in feature_mi:
    print(f"  {name:<25}: {score:.4f}")

# Select top k by mutual information
selector_mi = SelectKBest(mutual_info_classif, k=15)
X_selected = selector_mi.fit_transform(X_train, y_train)
selected_features = [feature_names[i] for i in selector_mi.get_support(indices=True)]
print("Selected features:", selected_features)

Variance Threshold

Python

from sklearn.feature_selection import VarianceThreshold

# Remove features with near-zero variance (they carry almost no signal)
vt = VarianceThreshold(threshold=0.01)   # Remove if variance < 0.01
X_var_filtered = vt.fit_transform(X_train_scaled)
n_removed = X_train.shape[1] - X_var_filtered.shape[1]
print(f"Removed {n_removed} low-variance features")

Wrapper Methods: Use a Model to Select

Recursive Feature Elimination (RFE)

Python

from sklearn.feature_selection import RFE, RFECV
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold

# RFE: train model, remove weakest feature, repeat until k features remain
rfe = RFE(
    estimator=LogisticRegression(max_iter=1000),
    n_features_to_select=10,
    step=1,
)
rfe.fit(X_train_scaled, y_train)
selected = [f for f, selected in zip(feature_names, rfe.support_) if selected]
print("RFE selected features:", selected)

# RFECV: uses cross-validation to find the optimal number of features automatically
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
rfecv = RFECV(
    estimator=LogisticRegression(max_iter=1000),
    step=1,
    cv=cv,
    scoring="roc_auc",
    min_features_to_select=5,
)
rfecv.fit(X_train_scaled, y_train)
print(f"Optimal number of features: {rfecv.n_features_}")

Embedded Methods: Selection During Training

L1 Regularization (Lasso)

Python

from sklearn.linear_model import LogisticRegression, Lasso
from sklearn.feature_selection import SelectFromModel

# L1 penalty drives some coefficients to exactly zero
# Remaining non-zero features = selected features

lr_l1 = LogisticRegression(C=0.1, penalty="l1", solver="liblinear", max_iter=1000)
lr_l1.fit(X_train_scaled, y_train)

# Which features survived L1?
nonzero = [(name, coef) for name, coef in zip(feature_names, lr_l1.coef_[0]) if coef != 0]
print("Non-zero L1 features:")
for name, coef in sorted(nonzero, key=lambda x: abs(x[1]), reverse=True):
    print(f"  {name:<25}: {coef:+.4f}")

# Use SelectFromModel to get a transformer
selector_l1 = SelectFromModel(lr_l1, prefit=True)
X_l1_selected = selector_l1.transform(X_train_scaled)
print(f"Features after L1 selection: {X_l1_selected.shape[1]}")

Tree Feature Importances

Python

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
import numpy as np

rf = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
rf.fit(X_train, y_train)

importances = pd.Series(rf.feature_importances_, index=feature_names)
importances_sorted = importances.sort_values(ascending=False)

print("Top 10 features by Random Forest importance:")
for name, imp in importances_sorted.head(10).items():
    bar = "█" * int(imp * 200)
    print(f"  {name:<25}: {imp:.4f}  {bar}")

# Select features above a threshold
threshold = importances.mean()
selected = importances[importances >= threshold].index.tolist()
print(f"\nSelected {len(selected)} features above mean importance")

Validating Feature Selection

Python

from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectKBest, mutual_info_classif

# WRONG: select features on all data, then cross-validate
# (leaks test fold information into feature selection)
selector = SelectKBest(mutual_info_classif, k=15)
X_selected = selector.fit_transform(X, y)   # Uses all data including test folds
scores = cross_val_score(LogisticRegression(), X_selected, y, cv=5)  # BIASED

# CORRECT: feature selection inside the pipeline
pipeline = Pipeline([
    ("selector", SelectKBest(mutual_info_classif, k=15)),
    ("scaler",   StandardScaler()),
    ("model",    LogisticRegression(max_iter=1000)),
])
scores = cross_val_score(pipeline, X, y, cv=5, scoring="roc_auc")
print(f"Correct CV AUC: {scores.mean():.3f} ± {scores.std():.3f}")
# In each fold, selector fits only on training portion of that fold

Comparing Selection Methods

Python

from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif, RFE
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score

k = 15  # Number of features to select

methods = {
    "All features":    Pipeline([("scaler", StandardScaler()), ("model", LogisticRegression(max_iter=1000))]),
    "F-score top 15":  Pipeline([("sel", SelectKBest(f_classif, k=k)),   ("scaler", StandardScaler()), ("model", LogisticRegression(max_iter=1000))]),
    "MI top 15":       Pipeline([("sel", SelectKBest(mutual_info_classif, k=k)), ("scaler", StandardScaler()), ("model", LogisticRegression(max_iter=1000))]),
    "RFE 15":          Pipeline([("scaler", StandardScaler()), ("sel", RFE(LogisticRegression(max_iter=500), n_features_to_select=k)), ("model", LogisticRegression(max_iter=1000))]),
    "L1 selection":    Pipeline([("scaler", StandardScaler()), ("sel", SelectFromModel(LogisticRegression(C=0.1, penalty="l1", solver="liblinear"))), ("model", LogisticRegression(max_iter=1000))]),
}

for name, pipe in methods.items():
    scores = cross_val_score(pipe, X_train, y_train, cv=5, scoring="roc_auc")
    print(f"{name:<20}: {scores.mean():.3f} ± {scores.std():.3f}")

Feature Selection Method Comparison

| Method | Type | Considers Model | Handles Non-Linear | Speed | |---|---|---|---|---| | Variance Threshold | Filter | No | No | Fast | | Correlation | Filter | No | No (linear only) | Fast | | Mutual Information | Filter | No | Yes | Fast | | RFE | Wrapper | Yes | Depends on model | Slow | | RFECV | Wrapper | Yes | Depends on model | Very slow | | L1 / Lasso | Embedded | Yes | No (linear) | Moderate | | Tree Importance | Embedded | Yes | Yes | Moderate |

Interview Answer Template

Q: How do you approach feature selection?

Feature selection is important because irrelevant features add noise, increase variance, and slow down training without improving generalization. I use three categories of methods depending on the situation. Filter methods (mutual information, correlation) are fast and model-agnostic — a good first pass. Wrapper methods like RFE use a model iteratively to select features — more accurate but computationally expensive. Embedded methods like L1 regularization and tree feature importance select features during training — a practical middle ground. The critical mistake to avoid: selecting features before cross-validation, which leaks test fold information into the training process and inflates performance estimates. I always put feature selection inside a sklearn Pipeline so it fits only on training data within each fold. For clinical ML, I also check selected features for clinical plausibility — a feature that's statistically informative but clinically nonsensical is likely spurious.

Encoding Categorical Variables: One-Hot, Label, Target

Next Lesson

Interview: Feature Engineering Walk-Through