Normalization vs Standardization

Two Approaches to the Same Goal

Both transform features onto a comparable scale — but with different formulas, different properties, and different use cases.

Normalization (Min-Max):
  x_scaled = (x - x_min) / (x_max - x_min)
  Output range: [0, 1]

Standardization (Z-score):
  x_scaled = (x - μ) / σ
  Output: zero mean, unit variance — no fixed range

Side-by-Side Comparison

Python

import numpy as np
from sklearn.preprocessing import MinMaxScaler, StandardScaler

# Serum creatinine values (mg/dL) — typical clinical range 0.5–1.5, outlier at 9.5
creatinine = np.array([[0.7], [0.9], [1.1], [0.8], [1.3], [9.5]])

# Normalization
mm = MinMaxScaler()
cr_mm = mm.fit_transform(creatinine)

# Standardization
std = StandardScaler()
cr_std = std.fit_transform(creatinine)

print(f"{'Value':>8}  {'MinMax':>8}  {'Z-score':>8}")
for raw, mm_val, std_val in zip(creatinine, cr_mm, cr_std):
    print(f"{raw[0]:>8.1f}  {mm_val[0]:>8.3f}  {std_val[0]:>8.3f}")

# Output:
#    Value    MinMax   Z-score
#      0.7     0.022    -0.545
#      0.9     0.044    -0.435
#      1.1     0.067    -0.325
#      0.8     0.033    -0.490
#      1.3     0.089    -0.215
#      9.5     1.000     2.010  ← outlier pushed to 1.0 in MinMax, 2.0 in Z-score

How Outliers Affect Each Method

Normalization with an outlier:
  creatinine values: 0.7, 0.9, 1.1, 0.8, 1.3, 9.5
  x_max = 9.5 → all "normal" values get compressed into [0, 0.09]
  The outlier dominates the scale — all other values cluster near zero

Standardization with an outlier:
  μ = 2.38, σ = 3.51
  Normal values: -0.5 to -0.2  (still distinguishable from each other)
  Outlier: +2.0  (flagged as an extreme value)
  Less compression, but mean/std are pulled by the outlier

Python

# Robustscaler: uses median and IQR instead of mean and std
from sklearn.preprocessing import RobustScaler

robust = RobustScaler()
cr_robust = robust.fit_transform(creatinine)

print("RobustScaler output (outlier-resistant):")
for raw, val in zip(creatinine, cr_robust):
    print(f"  {raw[0]:.1f} → {val[0]:+.3f}")
# Normal values stay in a tight, interpretable range
# Outlier is still large but doesn't compress others

When to Use Each

| Scenario | Recommended | Reason | |---|---|---| | Neural networks | Standardization | Sigmoid/tanh activations work best near 0 | | k-NN, k-Means | Either | Distance-based; both work if applied consistently | | SVM | Standardization | Kernel methods assume zero-mean features | | Logistic/Linear Regression | Standardization | Interpretable coefficients, faster convergence | | Data has known bounded range (e.g., pixel values 0–255) | Normalization | [0, 1] range is meaningful | | Data has heavy outliers | RobustScaler | Median/IQR-based, not pulled by extremes | | Image pixels for CNNs | Normalization to [0, 1] | Natural range, often expected by pre-trained models | | Tree-based models | Neither | Decision trees are scale-invariant |

Effect on Gradient Descent Convergence

Python

import numpy as np

# Intuition: gradient descent on unscaled features
# If age has range 0–80 and creatinine has range 0.5–15:
# The loss landscape is elongated along the age axis
# → gradient descent takes many small steps in age direction
# → convergence is slow and oscillatory

# After standardization: loss landscape is more circular
# → gradient descent converges in fewer steps

# Demonstrable with a simple experiment
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline

sgd_raw = SGDClassifier(max_iter=100, random_state=42)
sgd_scaled = Pipeline([("std", StandardScaler()), ("sgd", SGDClassifier(max_iter=100, random_state=42))])

from sklearn.model_selection import cross_val_score
scores_raw    = cross_val_score(sgd_raw,    X, y, cv=5, scoring="accuracy")
scores_scaled = cross_val_score(sgd_scaled, X, y, cv=5, scoring="accuracy")

print(f"SGD unscaled:  {scores_raw.mean():.3f}")
print(f"SGD scaled:    {scores_scaled.mean():.3f}")
# Scaled version typically converges faster and to a better solution

Coefficient Interpretability After Standardization

Python

from sklearn.linear_model import LogisticRegression

# After standardization, all features have the same scale
# → coefficient magnitude directly reflects feature importance
pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("lr", LogisticRegression(max_iter=1000)),
])
pipeline.fit(X_train, y_train)

lr = pipeline.named_steps["lr"]
scaler = pipeline.named_steps["scaler"]

print("Feature importances (standardized coefficients):")
for name, coef in sorted(zip(feature_names, lr.coef_[0]), key=lambda x: abs(x[1]), reverse=True):
    print(f"  {name:<25}: {coef:+.3f}")

# Without standardization: age coefficient is tiny (because age values are large)
# After standardization: coefficient reflects true importance

The Test Set Rule Applies Here Too

Python

# Common mistake with normalization specifically:
# x_min and x_max must come from training data only

mm = MinMaxScaler()
mm.fit(X_train)   # learn min/max from training

X_train_scaled = mm.transform(X_train)
X_test_scaled  = mm.transform(X_test)

# Test values CAN go outside [0, 1] if they fall outside the training range
# This is expected and correct — do not re-fit on test data
# A test creatinine of 15 might scale to 1.3 if max in training was 10 — that's fine

Interview Answer Template

Q: What's the difference between normalization and standardization?

Normalization (Min-Max scaling) rescales each feature to [0, 1] by subtracting the minimum and dividing by the range. Standardization (Z-score) transforms to zero mean and unit variance by subtracting the mean and dividing by the standard deviation. The key practical difference is outlier sensitivity: normalization compresses all "normal" values into a tiny range if there's a single outlier, because the outlier becomes the new maximum. Standardization handles outliers better — though its mean and std are still pulled. For neural networks and SVMs, standardization is typically preferred because activations and kernels assume zero-mean input. For algorithms with truly bounded inputs (like pixel values for CNNs), normalization to [0, 1] is natural. In both cases: fit the scaler on training data only — using it to transform test data with training statistics avoids data leakage.

Normalization vs Standardization

Two Approaches to the Same Goal

Side-by-Side Comparison

How Outliers Affect Each Method

When to Use Each

Effect on Gradient Descent Convergence

Coefficient Interpretability After Standardization

The Test Set Rule Applies Here Too

Interview Answer Template

Enjoyed this article?

Leave a comment