Z-Score Standardization

The Formula

z = (x - μ) / σ

μ = mean of the feature (computed from training data)
σ = standard deviation of the feature (computed from training data)

Result:
  - Mean of scaled feature = 0
  - Standard deviation of scaled feature = 1
  - No fixed range — values outside [-3, 3] are possible (especially outliers)

Implementation from Scratch

Python

import numpy as np

def standardize(X_train: np.ndarray, X_test: np.ndarray) -> tuple[np.ndarray, np.ndarray]:
    """
    Fit mean and std on training data, apply to both splits.
    ddof=0: population std (sklearn's default for StandardScaler).
    """
    mu = X_train.mean(axis=0)
    sigma = X_train.std(axis=0, ddof=0)

    # Avoid division by zero for constant features
    sigma = np.where(sigma == 0, 1.0, sigma)

    X_train_z = (X_train - mu) / sigma
    X_test_z  = (X_test  - mu) / sigma   # use TRAINING mean/std
    return X_train_z, X_test_z

# Patient clinical features
X_train = np.array([
    [45, 1.2,  8, 138],
    [67, 4.5, 15, 165],
    [32, 0.8,  3, 115],
    [55, 2.1, 11, 142],
    [48, 1.5,  9, 130],
])
X_test = np.array([[72, 6.1, 20, 180]])

X_train_z, X_test_z = standardize(X_train, X_test)
print("Standardized training data:")
print(X_train_z.round(2))
print(f"\nMean: {X_train_z.mean(axis=0).round(2)}")   # ≈ [0, 0, 0, 0]
print(f"Std:  {X_train_z.std(axis=0).round(2)}")      # ≈ [1, 1, 1, 1]

Using sklearn

Python

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
scaler.fit(X_train)   # compute mean and std from training only

X_train_scaled = scaler.transform(X_train)
X_test_scaled  = scaler.transform(X_test)

# Inspect learned statistics
print("Feature means:", scaler.mean_.round(2))
print("Feature stds:", scaler.scale_.round(2))

# Inverse: back to original scale for reporting
X_train_original = scaler.inverse_transform(X_train_scaled)

Why Z-Score Works for Gradient Descent

Python

# Intuition for why scale matters in gradient-based optimization

# Unscaled: weights have very different gradients because features are on different scales
# age gradient: ∂L/∂w_age ~ large (age values ~50)
# creatinine gradient: ∂L/∂w_cr ~ tiny (creatinine values ~1.2)
# → learning rate must be tiny to avoid exploding age gradient
# → creatinine gradient is too small to update effectively

# After standardization: all features have std=1
# → all gradients are on the same scale
# → a single learning rate works for all features
# → convergence is faster and more stable

from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score

# Compare convergence speed
for max_iter in [10, 50, 100, 500]:
    # Raw features
    raw_scores = cross_val_score(
        SGDClassifier(max_iter=max_iter, random_state=42),
        X_raw, y, cv=5, scoring="accuracy"
    )
    # Standardized
    scaled_scores = cross_val_score(
        Pipeline([("std", StandardScaler()), ("sgd", SGDClassifier(max_iter=max_iter, random_state=42))]),
        X_raw, y, cv=5, scoring="accuracy"
    )
    print(f"iter={max_iter:3d}: raw={raw_scores.mean():.3f}, scaled={scaled_scores.mean():.3f}")
# Scaled version reaches good performance with far fewer iterations

Z-Scores and Outlier Detection

Python

# Z-scores give a natural interpretation: how many stds from the mean
# |z| > 2: moderately unusual (~5% of data for Gaussian)
# |z| > 3: extreme (~0.3% of data)

# Clinical use: flag abnormal lab values
def flag_outliers(X: np.ndarray, feature_names: list, threshold: float = 3.0) -> pd.DataFrame:
    mu = X.mean(axis=0)
    sigma = X.std(axis=0)
    z_scores = np.abs((X - mu) / sigma)

    flags = []
    for i, row in enumerate(z_scores):
        for j, z in enumerate(row):
            if z > threshold:
                flags.append({
                    "patient_idx": i,
                    "feature": feature_names[j],
                    "z_score": round(float(z), 2),
                    "raw_value": X[i, j],
                })
    return pd.DataFrame(flags)

# In an ML pipeline, you might cap extreme values before scaling
def cap_outliers(X: np.ndarray, z_threshold: float = 3.0) -> np.ndarray:
    mu = X.mean(axis=0)
    sigma = X.std(axis=0)
    z = (X - mu) / sigma
    return np.clip(X, mu - z_threshold * sigma, mu + z_threshold * sigma)

Standardization in a Full Pipeline

Python

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, StratifiedKFold

# Pipeline guarantees no data leakage in cross-validation
pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("model",  LogisticRegression(max_iter=1000)),
])

# Hyperparameter search over model parameters (scaler has no tunable params here)
param_grid = {"model__C": [0.001, 0.01, 0.1, 1, 10, 100]}
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

search = GridSearchCV(pipeline, param_grid, cv=cv, scoring="roc_auc", n_jobs=-1)
search.fit(X_train, y_train)

print(f"Best C: {search.best_params_['model__C']}")
print(f"Best CV AUC: {search.best_score_:.3f}")

When NOT to Standardize

Python

# 1. Tree-based models: don't need scaling (thresholds are scale-invariant)
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100)
# No scaler needed — adding StandardScaler doesn't hurt but adds no value

# 2. When the feature is already a probability or fraction (0–1 range, meaningful)
# Standardizing it loses the "fraction" interpretation

# 3. When you need to interpret the raw feature scale for clinical reporting
# (Use the pipeline's inverse_transform for output interpretation instead)

# 4. Sparse data: StandardScaler destroys sparsity
# Use MaxAbsScaler or keep sparse format
from sklearn.preprocessing import MaxAbsScaler
sparse_scaler = MaxAbsScaler()   # Divides by max abs value, preserves zeros

Interview Answer Template

Q: What is Z-score standardization and when do you use it?

Z-score standardization transforms each feature to have zero mean and unit variance: subtract the feature mean and divide by its standard deviation. The result has no fixed range — most values fall between -3 and +3 for Gaussian-distributed data, but outliers can extend well beyond that. It's the preferred scaling method for gradient-based algorithms (logistic regression, SVMs, neural networks) because uniform feature scales mean all weights receive comparably-sized gradient updates, which leads to faster and more stable convergence. It also makes feature coefficients directly comparable — a larger coefficient means a more important feature when inputs are standardized. The key rule: compute mean and standard deviation from training data only, then apply the same transform to validation and test sets. I always use it inside a sklearn Pipeline when doing cross-validation to prevent any data leakage between folds.

Z-Score Standardization

The Formula

Implementation from Scratch

Using sklearn

Why Z-Score Works for Gradient Descent

Z-Scores and Outlier Detection

Standardization in a Full Pipeline

When NOT to Standardize

Interview Answer Template

Enjoyed this article?

Leave a comment