Machine Learning Foundations · Lesson 32 of 70

Normalization vs Standardization

Two Approaches to the Same Goal

Both transform features onto a comparable scale — but with different formulas, different properties, and different use cases.

Normalization (Min-Max):
  x_scaled = (x - x_min) / (x_max - x_min)
  Output range: [0, 1]

Standardization (Z-score):
  x_scaled = (x - μ) / σ
  Output: zero mean, unit variance — no fixed range

Side-by-Side Comparison

Python

import numpy as np
from sklearn.preprocessing import MinMaxScaler, StandardScaler

# Serum creatinine values (mg/dL) — typical clinical range 0.5–1.5, outlier at 9.5
creatinine = np.array([[0.7], [0.9], [1.1], [0.8], [1.3], [9.5]])

# Normalization
mm = MinMaxScaler()
cr_mm = mm.fit_transform(creatinine)

# Standardization
std = StandardScaler()
cr_std = std.fit_transform(creatinine)

print(f"{'Value':>8}  {'MinMax':>8}  {'Z-score':>8}")
for raw, mm_val, std_val in zip(creatinine, cr_mm, cr_std):
    print(f"{raw[0]:>8.1f}  {mm_val[0]:>8.3f}  {std_val[0]:>8.3f}")

# Output:
#    Value    MinMax   Z-score
#      0.7     0.022    -0.545
#      0.9     0.044    -0.435
#      1.1     0.067    -0.325
#      0.8     0.033    -0.490
#      1.3     0.089    -0.215
#      9.5     1.000     2.010  ← outlier pushed to 1.0 in MinMax, 2.0 in Z-score

How Outliers Affect Each Method

Normalization with an outlier:
  creatinine values: 0.7, 0.9, 1.1, 0.8, 1.3, 9.5
  x_max = 9.5 → all "normal" values get compressed into [0, 0.09]
  The outlier dominates the scale — all other values cluster near zero

Standardization with an outlier:
  μ = 2.38, σ = 3.51
  Normal values: -0.5 to -0.2  (still distinguishable from each other)
  Outlier: +2.0  (flagged as an extreme value)
  Less compression, but mean/std are pulled by the outlier

Python

# Robustscaler: uses median and IQR instead of mean and std
from sklearn.preprocessing import RobustScaler

robust = RobustScaler()
cr_robust = robust.fit_transform(creatinine)

print("RobustScaler output (outlier-resistant):")
for raw, val in zip(creatinine, cr_robust):
    print(f"  {raw[0]:.1f} → {val[0]:+.3f}")
# Normal values stay in a tight, interpretable range
# Outlier is still large but doesn't compress others

When to Use Each

| Scenario | Recommended | Reason | |---|---|---| | Neural networks | Standardization | Sigmoid/tanh activations work best near 0 | | k-NN, k-Means | Either | Distance-based; both work if applied consistently | | SVM | Standardization | Kernel methods assume zero-mean features | | Logistic/Linear Regression | Standardization | Interpretable coefficients, faster convergence | | Data has known bounded range (e.g., pixel values 0–255) | Normalization | [0, 1] range is meaningful | | Data has heavy outliers | RobustScaler | Median/IQR-based, not pulled by extremes | | Image pixels for CNNs | Normalization to [0, 1] | Natural range, often expected by pre-trained models | | Tree-based models | Neither | Decision trees are scale-invariant |

Effect on Gradient Descent Convergence

Python

import numpy as np

# Intuition: gradient descent on unscaled features
# If age has range 0–80 and creatinine has range 0.5–15:
# The loss landscape is elongated along the age axis
# → gradient descent takes many small steps in age direction
# → convergence is slow and oscillatory

# After standardization: loss landscape is more circular
# → gradient descent converges in fewer steps

# Demonstrable with a simple experiment
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline

sgd_raw = SGDClassifier(max_iter=100, random_state=42)
sgd_scaled = Pipeline([("std", StandardScaler()), ("sgd", SGDClassifier(max_iter=100, random_state=42))])

from sklearn.model_selection import cross_val_score
scores_raw    = cross_val_score(sgd_raw,    X, y, cv=5, scoring="accuracy")
scores_scaled = cross_val_score(sgd_scaled, X, y, cv=5, scoring="accuracy")

print(f"SGD unscaled:  {scores_raw.mean():.3f}")
print(f"SGD scaled:    {scores_scaled.mean():.3f}")
# Scaled version typically converges faster and to a better solution

Coefficient Interpretability After Standardization

Python

from sklearn.linear_model import LogisticRegression

# After standardization, all features have the same scale
# → coefficient magnitude directly reflects feature importance
pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("lr", LogisticRegression(max_iter=1000)),
])
pipeline.fit(X_train, y_train)

lr = pipeline.named_steps["lr"]
scaler = pipeline.named_steps["scaler"]

print("Feature importances (standardized coefficients):")
for name, coef in sorted(zip(feature_names, lr.coef_[0]), key=lambda x: abs(x[1]), reverse=True):
    print(f"  {name:<25}: {coef:+.3f}")

# Without standardization: age coefficient is tiny (because age values are large)
# After standardization: coefficient reflects true importance

The Test Set Rule Applies Here Too

Python

# Common mistake with normalization specifically:
# x_min and x_max must come from training data only

mm = MinMaxScaler()
mm.fit(X_train)   # learn min/max from training

X_train_scaled = mm.transform(X_train)
X_test_scaled  = mm.transform(X_test)

# Test values CAN go outside [0, 1] if they fall outside the training range
# This is expected and correct — do not re-fit on test data
# A test creatinine of 15 might scale to 1.3 if max in training was 10 — that's fine

Interview Answer Template

Q: What's the difference between normalization and standardization?

Normalization (Min-Max scaling) rescales each feature to [0, 1] by subtracting the minimum and dividing by the range. Standardization (Z-score) transforms to zero mean and unit variance by subtracting the mean and dividing by the standard deviation. The key practical difference is outlier sensitivity: normalization compresses all "normal" values into a tiny range if there's a single outlier, because the outlier becomes the new maximum. Standardization handles outliers better — though its mean and std are still pulled. For neural networks and SVMs, standardization is typically preferred because activations and kernels assume zero-mean input. For algorithms with truly bounded inputs (like pixel values for CNNs), normalization to [0, 1] is natural. In both cases: fit the scaler on training data only — using it to transform test data with training statistics avoids data leakage.

What is Feature Scaling and Why It Matters?

Next Lesson

Min-Max Scaling: When to Use It