Normalization vs Standardization
Compare normalization (Min-Max) and standardization (Z-score): formulas, when to use each, how they handle outliers, and which to choose for different algorithms and data distributions.
Two Approaches to the Same Goal
Both transform features onto a comparable scale ā but with different formulas, different properties, and different use cases.
Normalization (Min-Max):
x_scaled = (x - x_min) / (x_max - x_min)
Output range: [0, 1]
Standardization (Z-score):
x_scaled = (x - μ) / Ļ
Output: zero mean, unit variance ā no fixed rangeSide-by-Side Comparison
import numpy as np
from sklearn.preprocessing import MinMaxScaler, StandardScaler
# Serum creatinine values (mg/dL) ā typical clinical range 0.5ā1.5, outlier at 9.5
creatinine = np.array([[0.7], [0.9], [1.1], [0.8], [1.3], [9.5]])
# Normalization
mm = MinMaxScaler()
cr_mm = mm.fit_transform(creatinine)
# Standardization
std = StandardScaler()
cr_std = std.fit_transform(creatinine)
print(f"{'Value':>8} {'MinMax':>8} {'Z-score':>8}")
for raw, mm_val, std_val in zip(creatinine, cr_mm, cr_std):
print(f"{raw[0]:>8.1f} {mm_val[0]:>8.3f} {std_val[0]:>8.3f}")
# Output:
# Value MinMax Z-score
# 0.7 0.022 -0.545
# 0.9 0.044 -0.435
# 1.1 0.067 -0.325
# 0.8 0.033 -0.490
# 1.3 0.089 -0.215
# 9.5 1.000 2.010 ā outlier pushed to 1.0 in MinMax, 2.0 in Z-scoreHow Outliers Affect Each Method
Normalization with an outlier:
creatinine values: 0.7, 0.9, 1.1, 0.8, 1.3, 9.5
x_max = 9.5 ā all "normal" values get compressed into [0, 0.09]
The outlier dominates the scale ā all other values cluster near zero
Standardization with an outlier:
μ = 2.38, Ļ = 3.51
Normal values: -0.5 to -0.2 (still distinguishable from each other)
Outlier: +2.0 (flagged as an extreme value)
Less compression, but mean/std are pulled by the outlier# Robustscaler: uses median and IQR instead of mean and std
from sklearn.preprocessing import RobustScaler
robust = RobustScaler()
cr_robust = robust.fit_transform(creatinine)
print("RobustScaler output (outlier-resistant):")
for raw, val in zip(creatinine, cr_robust):
print(f" {raw[0]:.1f} ā {val[0]:+.3f}")
# Normal values stay in a tight, interpretable range
# Outlier is still large but doesn't compress othersWhen to Use Each
| Scenario | Recommended | Reason | |---|---|---| | Neural networks | Standardization | Sigmoid/tanh activations work best near 0 | | k-NN, k-Means | Either | Distance-based; both work if applied consistently | | SVM | Standardization | Kernel methods assume zero-mean features | | Logistic/Linear Regression | Standardization | Interpretable coefficients, faster convergence | | Data has known bounded range (e.g., pixel values 0ā255) | Normalization | [0, 1] range is meaningful | | Data has heavy outliers | RobustScaler | Median/IQR-based, not pulled by extremes | | Image pixels for CNNs | Normalization to [0, 1] | Natural range, often expected by pre-trained models | | Tree-based models | Neither | Decision trees are scale-invariant |
Effect on Gradient Descent Convergence
import numpy as np
# Intuition: gradient descent on unscaled features
# If age has range 0ā80 and creatinine has range 0.5ā15:
# The loss landscape is elongated along the age axis
# ā gradient descent takes many small steps in age direction
# ā convergence is slow and oscillatory
# After standardization: loss landscape is more circular
# ā gradient descent converges in fewer steps
# Demonstrable with a simple experiment
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
sgd_raw = SGDClassifier(max_iter=100, random_state=42)
sgd_scaled = Pipeline([("std", StandardScaler()), ("sgd", SGDClassifier(max_iter=100, random_state=42))])
from sklearn.model_selection import cross_val_score
scores_raw = cross_val_score(sgd_raw, X, y, cv=5, scoring="accuracy")
scores_scaled = cross_val_score(sgd_scaled, X, y, cv=5, scoring="accuracy")
print(f"SGD unscaled: {scores_raw.mean():.3f}")
print(f"SGD scaled: {scores_scaled.mean():.3f}")
# Scaled version typically converges faster and to a better solutionCoefficient Interpretability After Standardization
from sklearn.linear_model import LogisticRegression
# After standardization, all features have the same scale
# ā coefficient magnitude directly reflects feature importance
pipeline = Pipeline([
("scaler", StandardScaler()),
("lr", LogisticRegression(max_iter=1000)),
])
pipeline.fit(X_train, y_train)
lr = pipeline.named_steps["lr"]
scaler = pipeline.named_steps["scaler"]
print("Feature importances (standardized coefficients):")
for name, coef in sorted(zip(feature_names, lr.coef_[0]), key=lambda x: abs(x[1]), reverse=True):
print(f" {name:<25}: {coef:+.3f}")
# Without standardization: age coefficient is tiny (because age values are large)
# After standardization: coefficient reflects true importanceThe Test Set Rule Applies Here Too
# Common mistake with normalization specifically:
# x_min and x_max must come from training data only
mm = MinMaxScaler()
mm.fit(X_train) # learn min/max from training
X_train_scaled = mm.transform(X_train)
X_test_scaled = mm.transform(X_test)
# Test values CAN go outside [0, 1] if they fall outside the training range
# This is expected and correct ā do not re-fit on test data
# A test creatinine of 15 might scale to 1.3 if max in training was 10 ā that's fineInterview Answer Template
Q: What's the difference between normalization and standardization?
Normalization (Min-Max scaling) rescales each feature to [0, 1] by subtracting the minimum and dividing by the range. Standardization (Z-score) transforms to zero mean and unit variance by subtracting the mean and dividing by the standard deviation. The key practical difference is outlier sensitivity: normalization compresses all "normal" values into a tiny range if there's a single outlier, because the outlier becomes the new maximum. Standardization handles outliers better ā though its mean and std are still pulled. For neural networks and SVMs, standardization is typically preferred because activations and kernels assume zero-mean input. For algorithms with truly bounded inputs (like pixel values for CNNs), normalization to [0, 1] is natural. In both cases: fit the scaler on training data only ā using it to transform test data with training statistics avoids data leakage.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.