Learnixo
Back to blog
AI Systemsintermediate

What is Feature Scaling?

Feature scaling explained: why raw feature magnitudes mislead distance-based and gradient-based models, what scaling does, and which algorithms require it.

Asma Hafeez KhanMay 16, 20264 min read
Machine LearningFeature ScalingPreprocessingNormalizationInterview
Share:𝕏

The Problem with Raw Features

Real datasets have features measured on wildly different scales. This breaks several classes of ML algorithms.

Python
import pandas as pd

# Patient features β€” raw scales
patient = {
    "age":            45,        # years: 0–120
    "weight_kg":      82,        # kg: 30–300
    "serum_creatinine": 1.2,     # mg/dL: 0.5–15
    "num_medications": 8,        # count: 0–50
    "systolic_bp":    138,       # mmHg: 70–250
}

# Problem 1: Distance is dominated by large-magnitude features
# Euclidean distance between two patients:
# Ξ”age=5, Ξ”weight=30, Ξ”creatinine=0.1, Ξ”meds=2, Ξ”bp=20
# Raw distance β‰ˆ sqrt(5Β² + 30Β² + 0.1Β² + 2Β² + 20Β²) β‰ˆ 36.4
# Creatinine contributes almost nothing β€” even though it's clinically important

Why Magnitude Matters to Some Algorithms

Algorithm          Why raw scale is a problem
─────────────────────────────────────────────────────────────────
k-NN               Distance is dominated by large features
k-Means            Same β€” cluster centers pulled toward large scales
SVM                Decision boundary depends on distances in feature space
Neural Networks    Large inputs β†’ large gradients β†’ unstable learning
Linear/Logistic    Gradient descent converges slowly; weights aren't comparable
PCA                Variance is dominated by large-scale features
Ridge/Lasso        Penalty is applied to weights, which vary by feature scale

Algorithm          Why raw scale is OK
─────────────────────────────────────────────────────────────────
Decision Tree      Splits on thresholds β€” scale-invariant
Random Forest      Same β€” trees use threshold comparisons
Gradient Boosting  Same β€” tree-based, threshold splits only
Naive Bayes        Works on distributions per feature independently

What Scaling Does

Scaling transforms each feature so all features are on a comparable scale β€” without changing the information content.

Python
from sklearn.preprocessing import MinMaxScaler, StandardScaler
import numpy as np

# Before scaling
X_raw = np.array([
    [45,  82, 1.2,  8, 138],
    [67, 110, 4.5, 15, 165],
    [32,  58, 0.8,  3, 115],
])

print("Raw feature ranges:")
print(f"  age:         {X_raw[:, 0].min():.1f} – {X_raw[:, 0].max():.1f}")
print(f"  weight:      {X_raw[:, 1].min():.1f} – {X_raw[:, 1].max():.1f}")
print(f"  creatinine:  {X_raw[:, 2].min():.1f} – {X_raw[:, 2].max():.1f}")

# After MinMax scaling
scaler_mm = MinMaxScaler()
X_mm = scaler_mm.fit_transform(X_raw)

print("\nMinMax scaled ranges:")
print(f"  All features: {X_mm.min():.1f} – {X_mm.max():.1f}")   # 0.0 – 1.0

# After Standard scaling
scaler_std = StandardScaler()
X_std = scaler_std.fit_transform(X_raw)

print("\nStandard scaled stats:")
print(f"  Means: {X_std.mean(axis=0).round(2)}")   # β‰ˆ [0, 0, 0, 0, 0]
print(f"  Stds:  {X_std.std(axis=0).round(2)}")    # β‰ˆ [1, 1, 1, 1, 1]

The Fit-Only-on-Train Rule

This is the most common preprocessing mistake.

Python
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# WRONG β€” scaler sees test data (data leakage)
scaler = StandardScaler()
X_all_scaled = scaler.fit_transform(X)   # computed on all data
X_train_scaled = X_all_scaled[:len(X_train)]
X_test_scaled  = X_all_scaled[len(X_train):]

# CORRECT β€” scaler fitted on training data only
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)   # fit + transform train
X_test_scaled  = scaler.transform(X_test)         # transform only (use train stats)

# Why: in production, you won't have access to future data.
# The scaler's mean and std must come from training data alone.

Scaling in a Pipeline (Recommended)

Pipelines prevent the fit-on-all-data mistake by design.

Python
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

# Pipeline: scaling β†’ model
pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("model",  LogisticRegression(max_iter=1000)),
])

# cross_val_score fits the scaler on each training fold, applies to val fold
# No leakage β€” the pipeline handles it correctly
scores = cross_val_score(pipeline, X, y, cv=5, scoring="roc_auc")
print(f"Pipeline CV AUC: {scores.mean():.3f} Β± {scores.std():.3f}")

Scaling Matters More Than You Think

Python
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
import numpy as np

# Demonstrate the impact: k-NN with and without scaling
# (features: age 0-80, serum_creatinine 0.5-15 β€” very different scales)

knn = KNeighborsClassifier(n_neighbors=5)

# Without scaling
scores_raw = cross_val_score(knn, X_raw, y, cv=5, scoring="roc_auc")
print(f"k-NN unscaled:  {scores_raw.mean():.3f}")

# With scaling
pipeline_knn = Pipeline([("scaler", StandardScaler()), ("knn", knn)])
scores_scaled = cross_val_score(pipeline_knn, X_raw, y, cv=5, scoring="roc_auc")
print(f"k-NN scaled:    {scores_scaled.mean():.3f}")
# Typically a large improvement β€” sometimes 10-20% AUC difference

Interview Answer Template

Q: What is feature scaling and when do you need it?

Feature scaling transforms features onto a comparable numeric range. Raw features have wildly different magnitudes β€” age in years vs serum creatinine in mg/dL vs blood pressure in mmHg. Distance-based algorithms like k-NN and SVMs are dominated by high-magnitude features, making low-magnitude features irrelevant regardless of their predictive power. Gradient-based methods like logistic regression and neural networks converge much faster when features are on similar scales. The two main approaches are Min-Max scaling (rescales to [0, 1]) and standardization (zero mean, unit variance). The critical rule: fit the scaler on training data only, then apply it to val and test β€” fitting on all data leaks test statistics into training. In practice, I always use a sklearn Pipeline to enforce this automatically.

Enjoyed this article?

Explore the AI Systems learning path for more.

Found this helpful?

Share:𝕏

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.