Learnixo
Back to blog
AI Systemsintermediate

Which Algorithms Need Feature Scaling?

A definitive guide to which ML algorithms require feature scaling, which don't, and why — with code demonstrating the impact, scaling recommendations per algorithm, and a quick reference table.

Asma Hafeez KhanMay 16, 20266 min read
Machine LearningFeature ScalingAlgorithm SelectionPreprocessingInterview
Share:š•

The Core Rule

Feature scaling matters when an algorithm computes distances between points or uses gradient descent for optimization. It is irrelevant for algorithms that make decisions based on threshold comparisons.


Algorithms That REQUIRE Scaling

k-Nearest Neighbors (k-NN)

Python
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
import numpy as np

# k-NN: distance between patients
# age: 20–80, creatinine: 0.5–10 — very different ranges
# Without scaling, "nearest" neighbors are determined almost entirely by age

knn = KNeighborsClassifier(n_neighbors=5)

scores_raw = cross_val_score(knn, X_raw, y, cv=5, scoring="roc_auc")

knn_scaled = Pipeline([("std", StandardScaler()), ("knn", KNeighborsClassifier(n_neighbors=5))])
scores_scaled = cross_val_score(knn_scaled, X_raw, y, cv=5, scoring="roc_auc")

print(f"k-NN raw:    {scores_raw.mean():.3f}")
print(f"k-NN scaled: {scores_scaled.mean():.3f}")
# Commonly 10-20% AUC improvement

SVM and SVC

Python
from sklearn.svm import SVC

# SVM maximizes margin in feature space — margin width depends on feature scale
# Large-scale features dominate the margin

svm_raw    = SVC(kernel="rbf")
svm_scaled = Pipeline([("std", StandardScaler()), ("svm", SVC(kernel="rbf"))])

scores_raw    = cross_val_score(svm_raw,    X, y, cv=5, scoring="accuracy")
scores_scaled = cross_val_score(svm_scaled, X, y, cv=5, scoring="accuracy")

print(f"SVM raw:    {scores_raw.mean():.3f}")
print(f"SVM scaled: {scores_scaled.mean():.3f}")

Logistic Regression and Linear Regression

Python
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

# Gradient descent is biased toward features with large values
# → slow convergence if not scaled

lr_raw    = LogisticRegression(max_iter=100)
lr_scaled = Pipeline([("std", StandardScaler()), ("lr", LogisticRegression(max_iter=100))])

scores_raw    = cross_val_score(lr_raw,    X, y, cv=5)
scores_scaled = cross_val_score(lr_scaled, X, y, cv=5)

print(f"LR raw    (100 iter): {scores_raw.mean():.3f}")
print(f"LR scaled (100 iter): {scores_scaled.mean():.3f}")
# Scaled version typically converges in far fewer iterations

Neural Networks

Python
import torch
import torch.nn as nn

# Large input values → large activations → large gradients → unstable training
# Scaled inputs keep activations in the working range of ReLU/sigmoid/tanh

class DrugClassifier(nn.Module):
    def __init__(self, n_features: int):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_features, 64),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(64, 32),
            nn.ReLU(),
            nn.Linear(32, 1),
            nn.Sigmoid(),
        )
    def forward(self, x): return self.net(x)

# Always scale before feeding to a neural network
# StandardScaler or MinMaxScaler — both are commonly used

PCA (Principal Component Analysis)

Python
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# PCA finds directions of maximum variance
# Without scaling: high-variance features (large scale) dominate first components
# After scaling: PCA captures variance in all directions equally

pca_raw    = PCA(n_components=3).fit(X_raw)
pca_scaled = PCA(n_components=3).fit(StandardScaler().fit_transform(X_raw))

print("Explained variance (raw):", pca_raw.explained_variance_ratio_.round(3))
print("Explained variance (std):", pca_scaled.explained_variance_ratio_.round(3))
# Raw: first component captures ~95% because age dominates variance
# Std: variance spread more evenly across components

Algorithms That Do NOT Require Scaling

Decision Trees and Random Forests

Python
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# Decision trees split on thresholds: "age > 55"
# The threshold 55 doesn't change if age is scaled
# Splitting criterion (Gini, entropy) is computed per feature independently

# Proof: identical results with and without scaling
dt_raw    = DecisionTreeClassifier(max_depth=5, random_state=42)
dt_scaled = Pipeline([("std", StandardScaler()), ("dt", DecisionTreeClassifier(max_depth=5, random_state=42))])

scores_raw    = cross_val_score(dt_raw,    X, y, cv=5, scoring="roc_auc")
scores_scaled = cross_val_score(dt_scaled, X, y, cv=5, scoring="roc_auc")

print(f"Decision Tree raw:    {scores_raw.mean():.3f}")
print(f"Decision Tree scaled: {scores_scaled.mean():.3f}")
# Should be identical (or differ only by floating-point noise)

Gradient Boosting (XGBoost, LightGBM, sklearn GBM)

Python
from sklearn.ensemble import GradientBoostingClassifier

# Gradient boosting is built on decision trees — no distance, no gradient over inputs
# Feature scale is irrelevant

gbm = GradientBoostingClassifier(n_estimators=100, max_depth=3, random_state=42)
scores = cross_val_score(gbm, X_raw, y, cv=5, scoring="roc_auc")
print(f"GBM (no scaling needed): {scores.mean():.3f}")
# Scale the features → identical performance

Naive Bayes

Python
from sklearn.naive_bayes import GaussianNB

# Naive Bayes models each feature's distribution independently
# Scaling changes the distribution but doesn't change what the model learns
# (GaussianNB fits μ and σ per class per feature — scale is absorbed)

Summary Table

| Algorithm | Needs Scaling | Reason | Recommended Scaler | |---|---|---|---| | k-NN | Yes | Distance-based | StandardScaler | | k-Means | Yes | Distance-based | StandardScaler | | SVM (linear, RBF) | Yes | Distance/margin | StandardScaler | | Logistic Regression | Yes | Gradient descent | StandardScaler | | Linear Regression | Yes | Gradient descent | StandardScaler | | Ridge / Lasso | Yes | Penalty on weights | StandardScaler | | Neural Networks | Yes | Gradient magnitude | StandardScaler or MinMax | | PCA | Yes | Variance-based | StandardScaler | | Decision Tree | No | Threshold splits | None | | Random Forest | No | Threshold splits | None | | Gradient Boosting | No | Threshold splits | None | | XGBoost / LightGBM | No | Threshold splits | None | | Naive Bayes | No | Per-feature distribution | None |


Practical Rule for Interviews

Distance or gradient? → Scale (StandardScaler as default).
Tree-based splits?    → No scaling needed.
Unsure?               → Scale anyway — it never hurts tree-based models
                         and helps everything else.

One Common Mistake: Scaling Categorical Encoded Features

Python
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

# WRONG: applying StandardScaler to one-hot encoded features
# StandardScaler on binary {0,1} features: creates fractional values that lose meaning

# CORRECT: scale only numeric features, leave encoded categoricals alone
numeric_features     = ["age", "weight_kg", "serum_creatinine"]
categorical_features = ["gender", "discharge_to"]

preprocessor = ColumnTransformer(transformers=[
    ("num", StandardScaler(),   numeric_features),
    ("cat", OneHotEncoder(),    categorical_features),
])

# This scales numeric features and one-hot encodes categoricals — no overlap
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ("preprocessor", preprocessor),
    ("model", LogisticRegression(max_iter=1000)),
])

Interview Answer Template

Q: Which ML algorithms need feature scaling and which don't?

Algorithms that compute distances — like k-NN and k-Means — or that use gradient descent — like logistic regression, SVMs, and neural networks — are sensitive to feature scale. Without scaling, features with large magnitudes dominate distance calculations or receive disproportionately large gradient updates, making other features nearly irrelevant. PCA also requires scaling because it finds directions of maximum variance — large-scale features would otherwise capture the first component entirely. Tree-based algorithms — decision trees, random forests, gradient boosting — are scale-invariant because they split on threshold comparisons per feature independently. Adding a StandardScaler to a random forest pipeline is harmless but unnecessary. My default: use StandardScaler inside a Pipeline for any distance-based or gradient-based model, and skip it for tree-based models.

Enjoyed this article?

Explore the AI Systems learning path for more.

Found this helpful?

Share:š•

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.