Feature Selection
Feature selection methods: filter methods (correlation, mutual information), wrapper methods (RFE), embedded methods (L1 regularization, tree importance), and how to choose and validate feature selection.
Why Feature Selection Matters
More features are not always better:
- Irrelevant features add noise → higher variance → worse generalization
- Correlated features confuse coefficient-based models
- Many features increase computational cost
- Fewer features → simpler model → easier to interpret and explain
Filter Methods: No Model Needed
Correlation with the Target
import pandas as pd
import numpy as np
# Select features based on Pearson correlation with target (for continuous targets)
def select_by_correlation(X: pd.DataFrame, y: pd.Series, threshold: float = 0.1) -> list[str]:
correlations = X.corrwith(y).abs()
return correlations[correlations >= threshold].sort_values(ascending=False).index.tolist()
# Example: predicting INR from patient features
selected = select_by_correlation(X_df, y_inr, threshold=0.1)
print("Selected by correlation:", selected)Mutual Information
from sklearn.feature_selection import mutual_info_classif, mutual_info_regression
from sklearn.feature_selection import SelectKBest, f_classif
import numpy as np
# Mutual information: captures non-linear relationships (unlike correlation)
# Better for clinical data where relationships are non-linear
# For classification
mi_scores = mutual_info_classif(X_train, y_train, random_state=42)
feature_mi = sorted(zip(feature_names, mi_scores), key=lambda x: x[1], reverse=True)
print("Features ranked by mutual information:")
for name, score in feature_mi:
print(f" {name:<25}: {score:.4f}")
# Select top k by mutual information
selector_mi = SelectKBest(mutual_info_classif, k=15)
X_selected = selector_mi.fit_transform(X_train, y_train)
selected_features = [feature_names[i] for i in selector_mi.get_support(indices=True)]
print("Selected features:", selected_features)Variance Threshold
from sklearn.feature_selection import VarianceThreshold
# Remove features with near-zero variance (they carry almost no signal)
vt = VarianceThreshold(threshold=0.01) # Remove if variance < 0.01
X_var_filtered = vt.fit_transform(X_train_scaled)
n_removed = X_train.shape[1] - X_var_filtered.shape[1]
print(f"Removed {n_removed} low-variance features")Wrapper Methods: Use a Model to Select
Recursive Feature Elimination (RFE)
from sklearn.feature_selection import RFE, RFECV
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold
# RFE: train model, remove weakest feature, repeat until k features remain
rfe = RFE(
estimator=LogisticRegression(max_iter=1000),
n_features_to_select=10,
step=1,
)
rfe.fit(X_train_scaled, y_train)
selected = [f for f, selected in zip(feature_names, rfe.support_) if selected]
print("RFE selected features:", selected)
# RFECV: uses cross-validation to find the optimal number of features automatically
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
rfecv = RFECV(
estimator=LogisticRegression(max_iter=1000),
step=1,
cv=cv,
scoring="roc_auc",
min_features_to_select=5,
)
rfecv.fit(X_train_scaled, y_train)
print(f"Optimal number of features: {rfecv.n_features_}")Embedded Methods: Selection During Training
L1 Regularization (Lasso)
from sklearn.linear_model import LogisticRegression, Lasso
from sklearn.feature_selection import SelectFromModel
# L1 penalty drives some coefficients to exactly zero
# Remaining non-zero features = selected features
lr_l1 = LogisticRegression(C=0.1, penalty="l1", solver="liblinear", max_iter=1000)
lr_l1.fit(X_train_scaled, y_train)
# Which features survived L1?
nonzero = [(name, coef) for name, coef in zip(feature_names, lr_l1.coef_[0]) if coef != 0]
print("Non-zero L1 features:")
for name, coef in sorted(nonzero, key=lambda x: abs(x[1]), reverse=True):
print(f" {name:<25}: {coef:+.4f}")
# Use SelectFromModel to get a transformer
selector_l1 = SelectFromModel(lr_l1, prefit=True)
X_l1_selected = selector_l1.transform(X_train_scaled)
print(f"Features after L1 selection: {X_l1_selected.shape[1]}")Tree Feature Importances
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
import numpy as np
rf = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
rf.fit(X_train, y_train)
importances = pd.Series(rf.feature_importances_, index=feature_names)
importances_sorted = importances.sort_values(ascending=False)
print("Top 10 features by Random Forest importance:")
for name, imp in importances_sorted.head(10).items():
bar = "█" * int(imp * 200)
print(f" {name:<25}: {imp:.4f} {bar}")
# Select features above a threshold
threshold = importances.mean()
selected = importances[importances >= threshold].index.tolist()
print(f"\nSelected {len(selected)} features above mean importance")Validating Feature Selection
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectKBest, mutual_info_classif
# WRONG: select features on all data, then cross-validate
# (leaks test fold information into feature selection)
selector = SelectKBest(mutual_info_classif, k=15)
X_selected = selector.fit_transform(X, y) # Uses all data including test folds
scores = cross_val_score(LogisticRegression(), X_selected, y, cv=5) # BIASED
# CORRECT: feature selection inside the pipeline
pipeline = Pipeline([
("selector", SelectKBest(mutual_info_classif, k=15)),
("scaler", StandardScaler()),
("model", LogisticRegression(max_iter=1000)),
])
scores = cross_val_score(pipeline, X, y, cv=5, scoring="roc_auc")
print(f"Correct CV AUC: {scores.mean():.3f} ± {scores.std():.3f}")
# In each fold, selector fits only on training portion of that foldComparing Selection Methods
from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif, RFE
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
k = 15 # Number of features to select
methods = {
"All features": Pipeline([("scaler", StandardScaler()), ("model", LogisticRegression(max_iter=1000))]),
"F-score top 15": Pipeline([("sel", SelectKBest(f_classif, k=k)), ("scaler", StandardScaler()), ("model", LogisticRegression(max_iter=1000))]),
"MI top 15": Pipeline([("sel", SelectKBest(mutual_info_classif, k=k)), ("scaler", StandardScaler()), ("model", LogisticRegression(max_iter=1000))]),
"RFE 15": Pipeline([("scaler", StandardScaler()), ("sel", RFE(LogisticRegression(max_iter=500), n_features_to_select=k)), ("model", LogisticRegression(max_iter=1000))]),
"L1 selection": Pipeline([("scaler", StandardScaler()), ("sel", SelectFromModel(LogisticRegression(C=0.1, penalty="l1", solver="liblinear"))), ("model", LogisticRegression(max_iter=1000))]),
}
for name, pipe in methods.items():
scores = cross_val_score(pipe, X_train, y_train, cv=5, scoring="roc_auc")
print(f"{name:<20}: {scores.mean():.3f} ± {scores.std():.3f}")Feature Selection Method Comparison
| Method | Type | Considers Model | Handles Non-Linear | Speed | |---|---|---|---|---| | Variance Threshold | Filter | No | No | Fast | | Correlation | Filter | No | No (linear only) | Fast | | Mutual Information | Filter | No | Yes | Fast | | RFE | Wrapper | Yes | Depends on model | Slow | | RFECV | Wrapper | Yes | Depends on model | Very slow | | L1 / Lasso | Embedded | Yes | No (linear) | Moderate | | Tree Importance | Embedded | Yes | Yes | Moderate |
Interview Answer Template
Q: How do you approach feature selection?
Feature selection is important because irrelevant features add noise, increase variance, and slow down training without improving generalization. I use three categories of methods depending on the situation. Filter methods (mutual information, correlation) are fast and model-agnostic — a good first pass. Wrapper methods like RFE use a model iteratively to select features — more accurate but computationally expensive. Embedded methods like L1 regularization and tree feature importance select features during training — a practical middle ground. The critical mistake to avoid: selecting features before cross-validation, which leaks test fold information into the training process and inflates performance estimates. I always put feature selection inside a sklearn Pipeline so it fits only on training data within each fold. For clinical ML, I also check selected features for clinical plausibility — a feature that's statistically informative but clinically nonsensical is likely spurious.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.