Z-Score Standardization
Z-score standardization in depth: formula, implementation, why it works for gradient-based models, how to handle outliers, and a clinical example with correct pipeline usage.
The Formula
z = (x - μ) / σ
μ = mean of the feature (computed from training data)
σ = standard deviation of the feature (computed from training data)
Result:
- Mean of scaled feature = 0
- Standard deviation of scaled feature = 1
- No fixed range — values outside [-3, 3] are possible (especially outliers)Implementation from Scratch
import numpy as np
def standardize(X_train: np.ndarray, X_test: np.ndarray) -> tuple[np.ndarray, np.ndarray]:
"""
Fit mean and std on training data, apply to both splits.
ddof=0: population std (sklearn's default for StandardScaler).
"""
mu = X_train.mean(axis=0)
sigma = X_train.std(axis=0, ddof=0)
# Avoid division by zero for constant features
sigma = np.where(sigma == 0, 1.0, sigma)
X_train_z = (X_train - mu) / sigma
X_test_z = (X_test - mu) / sigma # use TRAINING mean/std
return X_train_z, X_test_z
# Patient clinical features
X_train = np.array([
[45, 1.2, 8, 138],
[67, 4.5, 15, 165],
[32, 0.8, 3, 115],
[55, 2.1, 11, 142],
[48, 1.5, 9, 130],
])
X_test = np.array([[72, 6.1, 20, 180]])
X_train_z, X_test_z = standardize(X_train, X_test)
print("Standardized training data:")
print(X_train_z.round(2))
print(f"\nMean: {X_train_z.mean(axis=0).round(2)}") # ≈ [0, 0, 0, 0]
print(f"Std: {X_train_z.std(axis=0).round(2)}") # ≈ [1, 1, 1, 1]Using sklearn
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
scaler.fit(X_train) # compute mean and std from training only
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Inspect learned statistics
print("Feature means:", scaler.mean_.round(2))
print("Feature stds:", scaler.scale_.round(2))
# Inverse: back to original scale for reporting
X_train_original = scaler.inverse_transform(X_train_scaled)Why Z-Score Works for Gradient Descent
# Intuition for why scale matters in gradient-based optimization
# Unscaled: weights have very different gradients because features are on different scales
# age gradient: ∂L/∂w_age ~ large (age values ~50)
# creatinine gradient: ∂L/∂w_cr ~ tiny (creatinine values ~1.2)
# → learning rate must be tiny to avoid exploding age gradient
# → creatinine gradient is too small to update effectively
# After standardization: all features have std=1
# → all gradients are on the same scale
# → a single learning rate works for all features
# → convergence is faster and more stable
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
# Compare convergence speed
for max_iter in [10, 50, 100, 500]:
# Raw features
raw_scores = cross_val_score(
SGDClassifier(max_iter=max_iter, random_state=42),
X_raw, y, cv=5, scoring="accuracy"
)
# Standardized
scaled_scores = cross_val_score(
Pipeline([("std", StandardScaler()), ("sgd", SGDClassifier(max_iter=max_iter, random_state=42))]),
X_raw, y, cv=5, scoring="accuracy"
)
print(f"iter={max_iter:3d}: raw={raw_scores.mean():.3f}, scaled={scaled_scores.mean():.3f}")
# Scaled version reaches good performance with far fewer iterationsZ-Scores and Outlier Detection
# Z-scores give a natural interpretation: how many stds from the mean
# |z| > 2: moderately unusual (~5% of data for Gaussian)
# |z| > 3: extreme (~0.3% of data)
# Clinical use: flag abnormal lab values
def flag_outliers(X: np.ndarray, feature_names: list, threshold: float = 3.0) -> pd.DataFrame:
mu = X.mean(axis=0)
sigma = X.std(axis=0)
z_scores = np.abs((X - mu) / sigma)
flags = []
for i, row in enumerate(z_scores):
for j, z in enumerate(row):
if z > threshold:
flags.append({
"patient_idx": i,
"feature": feature_names[j],
"z_score": round(float(z), 2),
"raw_value": X[i, j],
})
return pd.DataFrame(flags)
# In an ML pipeline, you might cap extreme values before scaling
def cap_outliers(X: np.ndarray, z_threshold: float = 3.0) -> np.ndarray:
mu = X.mean(axis=0)
sigma = X.std(axis=0)
z = (X - mu) / sigma
return np.clip(X, mu - z_threshold * sigma, mu + z_threshold * sigma)Standardization in a Full Pipeline
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, StratifiedKFold
# Pipeline guarantees no data leakage in cross-validation
pipeline = Pipeline([
("scaler", StandardScaler()),
("model", LogisticRegression(max_iter=1000)),
])
# Hyperparameter search over model parameters (scaler has no tunable params here)
param_grid = {"model__C": [0.001, 0.01, 0.1, 1, 10, 100]}
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
search = GridSearchCV(pipeline, param_grid, cv=cv, scoring="roc_auc", n_jobs=-1)
search.fit(X_train, y_train)
print(f"Best C: {search.best_params_['model__C']}")
print(f"Best CV AUC: {search.best_score_:.3f}")When NOT to Standardize
# 1. Tree-based models: don't need scaling (thresholds are scale-invariant)
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100)
# No scaler needed — adding StandardScaler doesn't hurt but adds no value
# 2. When the feature is already a probability or fraction (0–1 range, meaningful)
# Standardizing it loses the "fraction" interpretation
# 3. When you need to interpret the raw feature scale for clinical reporting
# (Use the pipeline's inverse_transform for output interpretation instead)
# 4. Sparse data: StandardScaler destroys sparsity
# Use MaxAbsScaler or keep sparse format
from sklearn.preprocessing import MaxAbsScaler
sparse_scaler = MaxAbsScaler() # Divides by max abs value, preserves zerosInterview Answer Template
Q: What is Z-score standardization and when do you use it?
Z-score standardization transforms each feature to have zero mean and unit variance: subtract the feature mean and divide by its standard deviation. The result has no fixed range — most values fall between -3 and +3 for Gaussian-distributed data, but outliers can extend well beyond that. It's the preferred scaling method for gradient-based algorithms (logistic regression, SVMs, neural networks) because uniform feature scales mean all weights receive comparably-sized gradient updates, which leads to faster and more stable convergence. It also makes feature coefficients directly comparable — a larger coefficient means a more important feature when inputs are standardized. The key rule: compute mean and standard deviation from training data only, then apply the same transform to validation and test sets. I always use it inside a sklearn Pipeline when doing cross-validation to prevent any data leakage between folds.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.