Machine Learning Foundations · Lesson 39 of 70
Feature Selection: Filter, Wrapper, Embedded
Why Feature Selection Matters
More features are not always better:
- Irrelevant features add noise → higher variance → worse generalization
- Correlated features confuse coefficient-based models
- Many features increase computational cost
- Fewer features → simpler model → easier to interpret and explain
Filter Methods: No Model Needed
Correlation with the Target
import pandas as pd
import numpy as np
# Select features based on Pearson correlation with target (for continuous targets)
def select_by_correlation(X: pd.DataFrame, y: pd.Series, threshold: float = 0.1) -> list[str]:
correlations = X.corrwith(y).abs()
return correlations[correlations >= threshold].sort_values(ascending=False).index.tolist()
# Example: predicting INR from patient features
selected = select_by_correlation(X_df, y_inr, threshold=0.1)
print("Selected by correlation:", selected)Mutual Information
from sklearn.feature_selection import mutual_info_classif, mutual_info_regression
from sklearn.feature_selection import SelectKBest, f_classif
import numpy as np
# Mutual information: captures non-linear relationships (unlike correlation)
# Better for clinical data where relationships are non-linear
# For classification
mi_scores = mutual_info_classif(X_train, y_train, random_state=42)
feature_mi = sorted(zip(feature_names, mi_scores), key=lambda x: x[1], reverse=True)
print("Features ranked by mutual information:")
for name, score in feature_mi:
print(f" {name:<25}: {score:.4f}")
# Select top k by mutual information
selector_mi = SelectKBest(mutual_info_classif, k=15)
X_selected = selector_mi.fit_transform(X_train, y_train)
selected_features = [feature_names[i] for i in selector_mi.get_support(indices=True)]
print("Selected features:", selected_features)Variance Threshold
from sklearn.feature_selection import VarianceThreshold
# Remove features with near-zero variance (they carry almost no signal)
vt = VarianceThreshold(threshold=0.01) # Remove if variance < 0.01
X_var_filtered = vt.fit_transform(X_train_scaled)
n_removed = X_train.shape[1] - X_var_filtered.shape[1]
print(f"Removed {n_removed} low-variance features")Wrapper Methods: Use a Model to Select
Recursive Feature Elimination (RFE)
from sklearn.feature_selection import RFE, RFECV
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold
# RFE: train model, remove weakest feature, repeat until k features remain
rfe = RFE(
estimator=LogisticRegression(max_iter=1000),
n_features_to_select=10,
step=1,
)
rfe.fit(X_train_scaled, y_train)
selected = [f for f, selected in zip(feature_names, rfe.support_) if selected]
print("RFE selected features:", selected)
# RFECV: uses cross-validation to find the optimal number of features automatically
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
rfecv = RFECV(
estimator=LogisticRegression(max_iter=1000),
step=1,
cv=cv,
scoring="roc_auc",
min_features_to_select=5,
)
rfecv.fit(X_train_scaled, y_train)
print(f"Optimal number of features: {rfecv.n_features_}")Embedded Methods: Selection During Training
L1 Regularization (Lasso)
from sklearn.linear_model import LogisticRegression, Lasso
from sklearn.feature_selection import SelectFromModel
# L1 penalty drives some coefficients to exactly zero
# Remaining non-zero features = selected features
lr_l1 = LogisticRegression(C=0.1, penalty="l1", solver="liblinear", max_iter=1000)
lr_l1.fit(X_train_scaled, y_train)
# Which features survived L1?
nonzero = [(name, coef) for name, coef in zip(feature_names, lr_l1.coef_[0]) if coef != 0]
print("Non-zero L1 features:")
for name, coef in sorted(nonzero, key=lambda x: abs(x[1]), reverse=True):
print(f" {name:<25}: {coef:+.4f}")
# Use SelectFromModel to get a transformer
selector_l1 = SelectFromModel(lr_l1, prefit=True)
X_l1_selected = selector_l1.transform(X_train_scaled)
print(f"Features after L1 selection: {X_l1_selected.shape[1]}")Tree Feature Importances
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
import numpy as np
rf = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
rf.fit(X_train, y_train)
importances = pd.Series(rf.feature_importances_, index=feature_names)
importances_sorted = importances.sort_values(ascending=False)
print("Top 10 features by Random Forest importance:")
for name, imp in importances_sorted.head(10).items():
bar = "█" * int(imp * 200)
print(f" {name:<25}: {imp:.4f} {bar}")
# Select features above a threshold
threshold = importances.mean()
selected = importances[importances >= threshold].index.tolist()
print(f"\nSelected {len(selected)} features above mean importance")Validating Feature Selection
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectKBest, mutual_info_classif
# WRONG: select features on all data, then cross-validate
# (leaks test fold information into feature selection)
selector = SelectKBest(mutual_info_classif, k=15)
X_selected = selector.fit_transform(X, y) # Uses all data including test folds
scores = cross_val_score(LogisticRegression(), X_selected, y, cv=5) # BIASED
# CORRECT: feature selection inside the pipeline
pipeline = Pipeline([
("selector", SelectKBest(mutual_info_classif, k=15)),
("scaler", StandardScaler()),
("model", LogisticRegression(max_iter=1000)),
])
scores = cross_val_score(pipeline, X, y, cv=5, scoring="roc_auc")
print(f"Correct CV AUC: {scores.mean():.3f} ± {scores.std():.3f}")
# In each fold, selector fits only on training portion of that foldComparing Selection Methods
from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif, RFE
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
k = 15 # Number of features to select
methods = {
"All features": Pipeline([("scaler", StandardScaler()), ("model", LogisticRegression(max_iter=1000))]),
"F-score top 15": Pipeline([("sel", SelectKBest(f_classif, k=k)), ("scaler", StandardScaler()), ("model", LogisticRegression(max_iter=1000))]),
"MI top 15": Pipeline([("sel", SelectKBest(mutual_info_classif, k=k)), ("scaler", StandardScaler()), ("model", LogisticRegression(max_iter=1000))]),
"RFE 15": Pipeline([("scaler", StandardScaler()), ("sel", RFE(LogisticRegression(max_iter=500), n_features_to_select=k)), ("model", LogisticRegression(max_iter=1000))]),
"L1 selection": Pipeline([("scaler", StandardScaler()), ("sel", SelectFromModel(LogisticRegression(C=0.1, penalty="l1", solver="liblinear"))), ("model", LogisticRegression(max_iter=1000))]),
}
for name, pipe in methods.items():
scores = cross_val_score(pipe, X_train, y_train, cv=5, scoring="roc_auc")
print(f"{name:<20}: {scores.mean():.3f} ± {scores.std():.3f}")Feature Selection Method Comparison
| Method | Type | Considers Model | Handles Non-Linear | Speed | |---|---|---|---|---| | Variance Threshold | Filter | No | No | Fast | | Correlation | Filter | No | No (linear only) | Fast | | Mutual Information | Filter | No | Yes | Fast | | RFE | Wrapper | Yes | Depends on model | Slow | | RFECV | Wrapper | Yes | Depends on model | Very slow | | L1 / Lasso | Embedded | Yes | No (linear) | Moderate | | Tree Importance | Embedded | Yes | Yes | Moderate |
Interview Answer Template
Q: How do you approach feature selection?
Feature selection is important because irrelevant features add noise, increase variance, and slow down training without improving generalization. I use three categories of methods depending on the situation. Filter methods (mutual information, correlation) are fast and model-agnostic — a good first pass. Wrapper methods like RFE use a model iteratively to select features — more accurate but computationally expensive. Embedded methods like L1 regularization and tree feature importance select features during training — a practical middle ground. The critical mistake to avoid: selecting features before cross-validation, which leaks test fold information into the training process and inflates performance estimates. I always put feature selection inside a sklearn Pipeline so it fits only on training data within each fold. For clinical ML, I also check selected features for clinical plausibility — a feature that's statistically informative but clinically nonsensical is likely spurious.