Which Algorithms Need Feature Scaling?
A definitive guide to which ML algorithms require feature scaling, which don't, and why ā with code demonstrating the impact, scaling recommendations per algorithm, and a quick reference table.
The Core Rule
Feature scaling matters when an algorithm computes distances between points or uses gradient descent for optimization. It is irrelevant for algorithms that make decisions based on threshold comparisons.
Algorithms That REQUIRE Scaling
k-Nearest Neighbors (k-NN)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
import numpy as np
# k-NN: distance between patients
# age: 20ā80, creatinine: 0.5ā10 ā very different ranges
# Without scaling, "nearest" neighbors are determined almost entirely by age
knn = KNeighborsClassifier(n_neighbors=5)
scores_raw = cross_val_score(knn, X_raw, y, cv=5, scoring="roc_auc")
knn_scaled = Pipeline([("std", StandardScaler()), ("knn", KNeighborsClassifier(n_neighbors=5))])
scores_scaled = cross_val_score(knn_scaled, X_raw, y, cv=5, scoring="roc_auc")
print(f"k-NN raw: {scores_raw.mean():.3f}")
print(f"k-NN scaled: {scores_scaled.mean():.3f}")
# Commonly 10-20% AUC improvementSVM and SVC
from sklearn.svm import SVC
# SVM maximizes margin in feature space ā margin width depends on feature scale
# Large-scale features dominate the margin
svm_raw = SVC(kernel="rbf")
svm_scaled = Pipeline([("std", StandardScaler()), ("svm", SVC(kernel="rbf"))])
scores_raw = cross_val_score(svm_raw, X, y, cv=5, scoring="accuracy")
scores_scaled = cross_val_score(svm_scaled, X, y, cv=5, scoring="accuracy")
print(f"SVM raw: {scores_raw.mean():.3f}")
print(f"SVM scaled: {scores_scaled.mean():.3f}")Logistic Regression and Linear Regression
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
# Gradient descent is biased toward features with large values
# ā slow convergence if not scaled
lr_raw = LogisticRegression(max_iter=100)
lr_scaled = Pipeline([("std", StandardScaler()), ("lr", LogisticRegression(max_iter=100))])
scores_raw = cross_val_score(lr_raw, X, y, cv=5)
scores_scaled = cross_val_score(lr_scaled, X, y, cv=5)
print(f"LR raw (100 iter): {scores_raw.mean():.3f}")
print(f"LR scaled (100 iter): {scores_scaled.mean():.3f}")
# Scaled version typically converges in far fewer iterationsNeural Networks
import torch
import torch.nn as nn
# Large input values ā large activations ā large gradients ā unstable training
# Scaled inputs keep activations in the working range of ReLU/sigmoid/tanh
class DrugClassifier(nn.Module):
def __init__(self, n_features: int):
super().__init__()
self.net = nn.Sequential(
nn.Linear(n_features, 64),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(64, 32),
nn.ReLU(),
nn.Linear(32, 1),
nn.Sigmoid(),
)
def forward(self, x): return self.net(x)
# Always scale before feeding to a neural network
# StandardScaler or MinMaxScaler ā both are commonly usedPCA (Principal Component Analysis)
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
# PCA finds directions of maximum variance
# Without scaling: high-variance features (large scale) dominate first components
# After scaling: PCA captures variance in all directions equally
pca_raw = PCA(n_components=3).fit(X_raw)
pca_scaled = PCA(n_components=3).fit(StandardScaler().fit_transform(X_raw))
print("Explained variance (raw):", pca_raw.explained_variance_ratio_.round(3))
print("Explained variance (std):", pca_scaled.explained_variance_ratio_.round(3))
# Raw: first component captures ~95% because age dominates variance
# Std: variance spread more evenly across componentsAlgorithms That Do NOT Require Scaling
Decision Trees and Random Forests
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
# Decision trees split on thresholds: "age > 55"
# The threshold 55 doesn't change if age is scaled
# Splitting criterion (Gini, entropy) is computed per feature independently
# Proof: identical results with and without scaling
dt_raw = DecisionTreeClassifier(max_depth=5, random_state=42)
dt_scaled = Pipeline([("std", StandardScaler()), ("dt", DecisionTreeClassifier(max_depth=5, random_state=42))])
scores_raw = cross_val_score(dt_raw, X, y, cv=5, scoring="roc_auc")
scores_scaled = cross_val_score(dt_scaled, X, y, cv=5, scoring="roc_auc")
print(f"Decision Tree raw: {scores_raw.mean():.3f}")
print(f"Decision Tree scaled: {scores_scaled.mean():.3f}")
# Should be identical (or differ only by floating-point noise)Gradient Boosting (XGBoost, LightGBM, sklearn GBM)
from sklearn.ensemble import GradientBoostingClassifier
# Gradient boosting is built on decision trees ā no distance, no gradient over inputs
# Feature scale is irrelevant
gbm = GradientBoostingClassifier(n_estimators=100, max_depth=3, random_state=42)
scores = cross_val_score(gbm, X_raw, y, cv=5, scoring="roc_auc")
print(f"GBM (no scaling needed): {scores.mean():.3f}")
# Scale the features ā identical performanceNaive Bayes
from sklearn.naive_bayes import GaussianNB
# Naive Bayes models each feature's distribution independently
# Scaling changes the distribution but doesn't change what the model learns
# (GaussianNB fits μ and Ļ per class per feature ā scale is absorbed)Summary Table
| Algorithm | Needs Scaling | Reason | Recommended Scaler | |---|---|---|---| | k-NN | Yes | Distance-based | StandardScaler | | k-Means | Yes | Distance-based | StandardScaler | | SVM (linear, RBF) | Yes | Distance/margin | StandardScaler | | Logistic Regression | Yes | Gradient descent | StandardScaler | | Linear Regression | Yes | Gradient descent | StandardScaler | | Ridge / Lasso | Yes | Penalty on weights | StandardScaler | | Neural Networks | Yes | Gradient magnitude | StandardScaler or MinMax | | PCA | Yes | Variance-based | StandardScaler | | Decision Tree | No | Threshold splits | None | | Random Forest | No | Threshold splits | None | | Gradient Boosting | No | Threshold splits | None | | XGBoost / LightGBM | No | Threshold splits | None | | Naive Bayes | No | Per-feature distribution | None |
Practical Rule for Interviews
Distance or gradient? ā Scale (StandardScaler as default).
Tree-based splits? ā No scaling needed.
Unsure? ā Scale anyway ā it never hurts tree-based models
and helps everything else.One Common Mistake: Scaling Categorical Encoded Features
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
# WRONG: applying StandardScaler to one-hot encoded features
# StandardScaler on binary {0,1} features: creates fractional values that lose meaning
# CORRECT: scale only numeric features, leave encoded categoricals alone
numeric_features = ["age", "weight_kg", "serum_creatinine"]
categorical_features = ["gender", "discharge_to"]
preprocessor = ColumnTransformer(transformers=[
("num", StandardScaler(), numeric_features),
("cat", OneHotEncoder(), categorical_features),
])
# This scales numeric features and one-hot encodes categoricals ā no overlap
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
("preprocessor", preprocessor),
("model", LogisticRegression(max_iter=1000)),
])Interview Answer Template
Q: Which ML algorithms need feature scaling and which don't?
Algorithms that compute distances ā like k-NN and k-Means ā or that use gradient descent ā like logistic regression, SVMs, and neural networks ā are sensitive to feature scale. Without scaling, features with large magnitudes dominate distance calculations or receive disproportionately large gradient updates, making other features nearly irrelevant. PCA also requires scaling because it finds directions of maximum variance ā large-scale features would otherwise capture the first component entirely. Tree-based algorithms ā decision trees, random forests, gradient boosting ā are scale-invariant because they split on threshold comparisons per feature independently. Adding a StandardScaler to a random forest pipeline is harmless but unnecessary. My default: use StandardScaler inside a Pipeline for any distance-based or gradient-based model, and skip it for tree-based models.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.