Learnixo

Machine Learning Foundations · Lesson 31 of 70

What is Feature Scaling and Why It Matters?

The Problem with Raw Features

Real datasets have features measured on wildly different scales. This breaks several classes of ML algorithms.

Python
import pandas as pd

# Patient features  raw scales
patient = {
    "age":            45,        # years: 0–120
    "weight_kg":      82,        # kg: 30–300
    "serum_creatinine": 1.2,     # mg/dL: 0.5–15
    "num_medications": 8,        # count: 0–50
    "systolic_bp":    138,       # mmHg: 70–250
}

# Problem 1: Distance is dominated by large-magnitude features
# Euclidean distance between two patients:
# Δage=5, Δweight=30, Δcreatinine=0.1, Δmeds=2, Δbp=20
# Raw distance  sqrt( + 30² + 0. +  + 20²)  36.4
# Creatinine contributes almost nothing  even though it's clinically important

Why Magnitude Matters to Some Algorithms

Algorithm          Why raw scale is a problem
─────────────────────────────────────────────────────────────────
k-NN               Distance is dominated by large features
k-Means            Same — cluster centers pulled toward large scales
SVM                Decision boundary depends on distances in feature space
Neural Networks    Large inputs → large gradients → unstable learning
Linear/Logistic    Gradient descent converges slowly; weights aren't comparable
PCA                Variance is dominated by large-scale features
Ridge/Lasso        Penalty is applied to weights, which vary by feature scale

Algorithm          Why raw scale is OK
─────────────────────────────────────────────────────────────────
Decision Tree      Splits on thresholds — scale-invariant
Random Forest      Same — trees use threshold comparisons
Gradient Boosting  Same — tree-based, threshold splits only
Naive Bayes        Works on distributions per feature independently

What Scaling Does

Scaling transforms each feature so all features are on a comparable scale — without changing the information content.

Python
from sklearn.preprocessing import MinMaxScaler, StandardScaler
import numpy as np

# Before scaling
X_raw = np.array([
    [45,  82, 1.2,  8, 138],
    [67, 110, 4.5, 15, 165],
    [32,  58, 0.8,  3, 115],
])

print("Raw feature ranges:")
print(f"  age:         {X_raw[:, 0].min():.1f} – {X_raw[:, 0].max():.1f}")
print(f"  weight:      {X_raw[:, 1].min():.1f} – {X_raw[:, 1].max():.1f}")
print(f"  creatinine:  {X_raw[:, 2].min():.1f} – {X_raw[:, 2].max():.1f}")

# After MinMax scaling
scaler_mm = MinMaxScaler()
X_mm = scaler_mm.fit_transform(X_raw)

print("\nMinMax scaled ranges:")
print(f"  All features: {X_mm.min():.1f} – {X_mm.max():.1f}")   # 0.0  1.0

# After Standard scaling
scaler_std = StandardScaler()
X_std = scaler_std.fit_transform(X_raw)

print("\nStandard scaled stats:")
print(f"  Means: {X_std.mean(axis=0).round(2)}")   #  [0, 0, 0, 0, 0]
print(f"  Stds:  {X_std.std(axis=0).round(2)}")    #  [1, 1, 1, 1, 1]

The Fit-Only-on-Train Rule

This is the most common preprocessing mistake.

Python
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# WRONG  scaler sees test data (data leakage)
scaler = StandardScaler()
X_all_scaled = scaler.fit_transform(X)   # computed on all data
X_train_scaled = X_all_scaled[:len(X_train)]
X_test_scaled  = X_all_scaled[len(X_train):]

# CORRECT  scaler fitted on training data only
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)   # fit + transform train
X_test_scaled  = scaler.transform(X_test)         # transform only (use train stats)

# Why: in production, you won't have access to future data.
# The scaler's mean and std must come from training data alone.

Scaling in a Pipeline (Recommended)

Pipelines prevent the fit-on-all-data mistake by design.

Python
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

# Pipeline: scaling  model
pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("model",  LogisticRegression(max_iter=1000)),
])

# cross_val_score fits the scaler on each training fold, applies to val fold
# No leakage  the pipeline handles it correctly
scores = cross_val_score(pipeline, X, y, cv=5, scoring="roc_auc")
print(f"Pipeline CV AUC: {scores.mean():.3f} ± {scores.std():.3f}")

Scaling Matters More Than You Think

Python
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
import numpy as np

# Demonstrate the impact: k-NN with and without scaling
# (features: age 0-80, serum_creatinine 0.5-15  very different scales)

knn = KNeighborsClassifier(n_neighbors=5)

# Without scaling
scores_raw = cross_val_score(knn, X_raw, y, cv=5, scoring="roc_auc")
print(f"k-NN unscaled:  {scores_raw.mean():.3f}")

# With scaling
pipeline_knn = Pipeline([("scaler", StandardScaler()), ("knn", knn)])
scores_scaled = cross_val_score(pipeline_knn, X_raw, y, cv=5, scoring="roc_auc")
print(f"k-NN scaled:    {scores_scaled.mean():.3f}")
# Typically a large improvement  sometimes 10-20% AUC difference

Interview Answer Template

Q: What is feature scaling and when do you need it?

Feature scaling transforms features onto a comparable numeric range. Raw features have wildly different magnitudes — age in years vs serum creatinine in mg/dL vs blood pressure in mmHg. Distance-based algorithms like k-NN and SVMs are dominated by high-magnitude features, making low-magnitude features irrelevant regardless of their predictive power. Gradient-based methods like logistic regression and neural networks converge much faster when features are on similar scales. The two main approaches are Min-Max scaling (rescales to [0, 1]) and standardization (zero mean, unit variance). The critical rule: fit the scaler on training data only, then apply it to val and test — fitting on all data leaks test statistics into training. In practice, I always use a sklearn Pipeline to enforce this automatically.