Which Algorithms Need Feature Scaling?

The Core Rule

Feature scaling matters when an algorithm computes distances between points or uses gradient descent for optimization. It is irrelevant for algorithms that make decisions based on threshold comparisons.

Algorithms That REQUIRE Scaling

k-Nearest Neighbors (k-NN)

Python

from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
import numpy as np

# k-NN: distance between patients
# age: 20–80, creatinine: 0.5–10 — very different ranges
# Without scaling, "nearest" neighbors are determined almost entirely by age

knn = KNeighborsClassifier(n_neighbors=5)

scores_raw = cross_val_score(knn, X_raw, y, cv=5, scoring="roc_auc")

knn_scaled = Pipeline([("std", StandardScaler()), ("knn", KNeighborsClassifier(n_neighbors=5))])
scores_scaled = cross_val_score(knn_scaled, X_raw, y, cv=5, scoring="roc_auc")

print(f"k-NN raw:    {scores_raw.mean():.3f}")
print(f"k-NN scaled: {scores_scaled.mean():.3f}")
# Commonly 10-20% AUC improvement

SVM and SVC

Python

from sklearn.svm import SVC

# SVM maximizes margin in feature space — margin width depends on feature scale
# Large-scale features dominate the margin

svm_raw    = SVC(kernel="rbf")
svm_scaled = Pipeline([("std", StandardScaler()), ("svm", SVC(kernel="rbf"))])

scores_raw    = cross_val_score(svm_raw,    X, y, cv=5, scoring="accuracy")
scores_scaled = cross_val_score(svm_scaled, X, y, cv=5, scoring="accuracy")

print(f"SVM raw:    {scores_raw.mean():.3f}")
print(f"SVM scaled: {scores_scaled.mean():.3f}")

Logistic Regression and Linear Regression

Python

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

# Gradient descent is biased toward features with large values
# → slow convergence if not scaled

lr_raw    = LogisticRegression(max_iter=100)
lr_scaled = Pipeline([("std", StandardScaler()), ("lr", LogisticRegression(max_iter=100))])

scores_raw    = cross_val_score(lr_raw,    X, y, cv=5)
scores_scaled = cross_val_score(lr_scaled, X, y, cv=5)

print(f"LR raw    (100 iter): {scores_raw.mean():.3f}")
print(f"LR scaled (100 iter): {scores_scaled.mean():.3f}")
# Scaled version typically converges in far fewer iterations

Neural Networks

Python

import torch
import torch.nn as nn

# Large input values → large activations → large gradients → unstable training
# Scaled inputs keep activations in the working range of ReLU/sigmoid/tanh

class DrugClassifier(nn.Module):
    def __init__(self, n_features: int):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_features, 64),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(64, 32),
            nn.ReLU(),
            nn.Linear(32, 1),
            nn.Sigmoid(),
        )
    def forward(self, x): return self.net(x)

# Always scale before feeding to a neural network
# StandardScaler or MinMaxScaler — both are commonly used

PCA (Principal Component Analysis)

Python

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# PCA finds directions of maximum variance
# Without scaling: high-variance features (large scale) dominate first components
# After scaling: PCA captures variance in all directions equally

pca_raw    = PCA(n_components=3).fit(X_raw)
pca_scaled = PCA(n_components=3).fit(StandardScaler().fit_transform(X_raw))

print("Explained variance (raw):", pca_raw.explained_variance_ratio_.round(3))
print("Explained variance (std):", pca_scaled.explained_variance_ratio_.round(3))
# Raw: first component captures ~95% because age dominates variance
# Std: variance spread more evenly across components

Algorithms That Do NOT Require Scaling

Decision Trees and Random Forests

Python

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# Decision trees split on thresholds: "age > 55"
# The threshold 55 doesn't change if age is scaled
# Splitting criterion (Gini, entropy) is computed per feature independently

# Proof: identical results with and without scaling
dt_raw    = DecisionTreeClassifier(max_depth=5, random_state=42)
dt_scaled = Pipeline([("std", StandardScaler()), ("dt", DecisionTreeClassifier(max_depth=5, random_state=42))])

scores_raw    = cross_val_score(dt_raw,    X, y, cv=5, scoring="roc_auc")
scores_scaled = cross_val_score(dt_scaled, X, y, cv=5, scoring="roc_auc")

print(f"Decision Tree raw:    {scores_raw.mean():.3f}")
print(f"Decision Tree scaled: {scores_scaled.mean():.3f}")
# Should be identical (or differ only by floating-point noise)

Gradient Boosting (XGBoost, LightGBM, sklearn GBM)

Python

from sklearn.ensemble import GradientBoostingClassifier

# Gradient boosting is built on decision trees — no distance, no gradient over inputs
# Feature scale is irrelevant

gbm = GradientBoostingClassifier(n_estimators=100, max_depth=3, random_state=42)
scores = cross_val_score(gbm, X_raw, y, cv=5, scoring="roc_auc")
print(f"GBM (no scaling needed): {scores.mean():.3f}")
# Scale the features → identical performance

Naive Bayes

Python

from sklearn.naive_bayes import GaussianNB

# Naive Bayes models each feature's distribution independently
# Scaling changes the distribution but doesn't change what the model learns
# (GaussianNB fits μ and σ per class per feature — scale is absorbed)

Summary Table

| Algorithm | Needs Scaling | Reason | Recommended Scaler | |---|---|---|---| | k-NN | Yes | Distance-based | StandardScaler | | k-Means | Yes | Distance-based | StandardScaler | | SVM (linear, RBF) | Yes | Distance/margin | StandardScaler | | Logistic Regression | Yes | Gradient descent | StandardScaler | | Linear Regression | Yes | Gradient descent | StandardScaler | | Ridge / Lasso | Yes | Penalty on weights | StandardScaler | | Neural Networks | Yes | Gradient magnitude | StandardScaler or MinMax | | PCA | Yes | Variance-based | StandardScaler | | Decision Tree | No | Threshold splits | None | | Random Forest | No | Threshold splits | None | | Gradient Boosting | No | Threshold splits | None | | XGBoost / LightGBM | No | Threshold splits | None | | Naive Bayes | No | Per-feature distribution | None |

Practical Rule for Interviews

Distance or gradient? → Scale (StandardScaler as default).
Tree-based splits?    → No scaling needed.
Unsure?               → Scale anyway — it never hurts tree-based models
                         and helps everything else.

One Common Mistake: Scaling Categorical Encoded Features

Python

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

# WRONG: applying StandardScaler to one-hot encoded features
# StandardScaler on binary {0,1} features: creates fractional values that lose meaning

# CORRECT: scale only numeric features, leave encoded categoricals alone
numeric_features     = ["age", "weight_kg", "serum_creatinine"]
categorical_features = ["gender", "discharge_to"]

preprocessor = ColumnTransformer(transformers=[
    ("num", StandardScaler(),   numeric_features),
    ("cat", OneHotEncoder(),    categorical_features),
])

# This scales numeric features and one-hot encodes categoricals — no overlap
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ("preprocessor", preprocessor),
    ("model", LogisticRegression(max_iter=1000)),
])

Interview Answer Template

Q: Which ML algorithms need feature scaling and which don't?

Algorithms that compute distances — like k-NN and k-Means — or that use gradient descent — like logistic regression, SVMs, and neural networks — are sensitive to feature scale. Without scaling, features with large magnitudes dominate distance calculations or receive disproportionately large gradient updates, making other features nearly irrelevant. PCA also requires scaling because it finds directions of maximum variance — large-scale features would otherwise capture the first component entirely. Tree-based algorithms — decision trees, random forests, gradient boosting — are scale-invariant because they split on threshold comparisons per feature independently. Adding a StandardScaler to a random forest pipeline is harmless but unnecessary. My default: use StandardScaler inside a Pipeline for any distance-based or gradient-based model, and skip it for tree-based models.

Which Algorithms Need Feature Scaling?

The Core Rule

Algorithms That REQUIRE Scaling

k-Nearest Neighbors (k-NN)

SVM and SVC

Logistic Regression and Linear Regression

Neural Networks

PCA (Principal Component Analysis)

Algorithms That Do NOT Require Scaling

Decision Trees and Random Forests

Gradient Boosting (XGBoost, LightGBM, sklearn GBM)

Naive Bayes

Summary Table

Practical Rule for Interviews

One Common Mistake: Scaling Categorical Encoded Features

Interview Answer Template

Enjoyed this article?

Leave a comment