Machine Learning Foundations · Lesson 31 of 70
What is Feature Scaling and Why It Matters?
The Problem with Raw Features
Real datasets have features measured on wildly different scales. This breaks several classes of ML algorithms.
import pandas as pd
# Patient features — raw scales
patient = {
"age": 45, # years: 0–120
"weight_kg": 82, # kg: 30–300
"serum_creatinine": 1.2, # mg/dL: 0.5–15
"num_medications": 8, # count: 0–50
"systolic_bp": 138, # mmHg: 70–250
}
# Problem 1: Distance is dominated by large-magnitude features
# Euclidean distance between two patients:
# Δage=5, Δweight=30, Δcreatinine=0.1, Δmeds=2, Δbp=20
# Raw distance ≈ sqrt(5² + 30² + 0.1² + 2² + 20²) ≈ 36.4
# Creatinine contributes almost nothing — even though it's clinically importantWhy Magnitude Matters to Some Algorithms
Algorithm Why raw scale is a problem
─────────────────────────────────────────────────────────────────
k-NN Distance is dominated by large features
k-Means Same — cluster centers pulled toward large scales
SVM Decision boundary depends on distances in feature space
Neural Networks Large inputs → large gradients → unstable learning
Linear/Logistic Gradient descent converges slowly; weights aren't comparable
PCA Variance is dominated by large-scale features
Ridge/Lasso Penalty is applied to weights, which vary by feature scale
Algorithm Why raw scale is OK
─────────────────────────────────────────────────────────────────
Decision Tree Splits on thresholds — scale-invariant
Random Forest Same — trees use threshold comparisons
Gradient Boosting Same — tree-based, threshold splits only
Naive Bayes Works on distributions per feature independentlyWhat Scaling Does
Scaling transforms each feature so all features are on a comparable scale — without changing the information content.
from sklearn.preprocessing import MinMaxScaler, StandardScaler
import numpy as np
# Before scaling
X_raw = np.array([
[45, 82, 1.2, 8, 138],
[67, 110, 4.5, 15, 165],
[32, 58, 0.8, 3, 115],
])
print("Raw feature ranges:")
print(f" age: {X_raw[:, 0].min():.1f} – {X_raw[:, 0].max():.1f}")
print(f" weight: {X_raw[:, 1].min():.1f} – {X_raw[:, 1].max():.1f}")
print(f" creatinine: {X_raw[:, 2].min():.1f} – {X_raw[:, 2].max():.1f}")
# After MinMax scaling
scaler_mm = MinMaxScaler()
X_mm = scaler_mm.fit_transform(X_raw)
print("\nMinMax scaled ranges:")
print(f" All features: {X_mm.min():.1f} – {X_mm.max():.1f}") # 0.0 – 1.0
# After Standard scaling
scaler_std = StandardScaler()
X_std = scaler_std.fit_transform(X_raw)
print("\nStandard scaled stats:")
print(f" Means: {X_std.mean(axis=0).round(2)}") # ≈ [0, 0, 0, 0, 0]
print(f" Stds: {X_std.std(axis=0).round(2)}") # ≈ [1, 1, 1, 1, 1]The Fit-Only-on-Train Rule
This is the most common preprocessing mistake.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# WRONG — scaler sees test data (data leakage)
scaler = StandardScaler()
X_all_scaled = scaler.fit_transform(X) # computed on all data
X_train_scaled = X_all_scaled[:len(X_train)]
X_test_scaled = X_all_scaled[len(X_train):]
# CORRECT — scaler fitted on training data only
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) # fit + transform train
X_test_scaled = scaler.transform(X_test) # transform only (use train stats)
# Why: in production, you won't have access to future data.
# The scaler's mean and std must come from training data alone.Scaling in a Pipeline (Recommended)
Pipelines prevent the fit-on-all-data mistake by design.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
# Pipeline: scaling → model
pipeline = Pipeline([
("scaler", StandardScaler()),
("model", LogisticRegression(max_iter=1000)),
])
# cross_val_score fits the scaler on each training fold, applies to val fold
# No leakage — the pipeline handles it correctly
scores = cross_val_score(pipeline, X, y, cv=5, scoring="roc_auc")
print(f"Pipeline CV AUC: {scores.mean():.3f} ± {scores.std():.3f}")Scaling Matters More Than You Think
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
import numpy as np
# Demonstrate the impact: k-NN with and without scaling
# (features: age 0-80, serum_creatinine 0.5-15 — very different scales)
knn = KNeighborsClassifier(n_neighbors=5)
# Without scaling
scores_raw = cross_val_score(knn, X_raw, y, cv=5, scoring="roc_auc")
print(f"k-NN unscaled: {scores_raw.mean():.3f}")
# With scaling
pipeline_knn = Pipeline([("scaler", StandardScaler()), ("knn", knn)])
scores_scaled = cross_val_score(pipeline_knn, X_raw, y, cv=5, scoring="roc_auc")
print(f"k-NN scaled: {scores_scaled.mean():.3f}")
# Typically a large improvement — sometimes 10-20% AUC differenceInterview Answer Template
Q: What is feature scaling and when do you need it?
Feature scaling transforms features onto a comparable numeric range. Raw features have wildly different magnitudes — age in years vs serum creatinine in mg/dL vs blood pressure in mmHg. Distance-based algorithms like k-NN and SVMs are dominated by high-magnitude features, making low-magnitude features irrelevant regardless of their predictive power. Gradient-based methods like logistic regression and neural networks converge much faster when features are on similar scales. The two main approaches are Min-Max scaling (rescales to [0, 1]) and standardization (zero mean, unit variance). The critical rule: fit the scaler on training data only, then apply it to val and test — fitting on all data leaks test statistics into training. In practice, I always use a sklearn Pipeline to enforce this automatically.