Machine Learning Foundations · Lesson 20 of 70

Cross-Validation: When to Use It and Why

Why Cross-Validation?

A single train/val split is unreliable on small datasets. If you got lucky (or unlucky) with which examples ended up in the validation set, your performance estimate has high variance.

Cross-validation solves this by evaluating the model on multiple different validation splits and averaging the results.

k-Fold Cross-Validation

Split data into k equal "folds." Train on k-1 folds, validate on the remaining 1. Repeat k times, each time using a different fold as validation. Average the k scores.

k=5 example:
Fold 1: [Train | Train | Train | Train | VAL]
Fold 2: [Train | Train | Train | VAL | Train]
Fold 3: [Train | Train | VAL | Train | Train]
Fold 4: [Train | VAL | Train | Train | Train]
Fold 5: [VAL | Train | Train | Train | Train]

Average the 5 validation scores → more reliable estimate

Python

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, KFold
import numpy as np

X = np.random.randn(300, 20)
y = np.random.randint(0, 2, 300)

model = RandomForestClassifier(n_estimators=100, random_state=42)

# Simple k-fold
cv_scores = cross_val_score(model, X, y, cv=5, scoring="roc_auc")
print(f"AUC scores per fold: {cv_scores.round(3)}")
print(f"Mean AUC: {cv_scores.mean():.3f}")
print(f"Std AUC:  {cv_scores.std():.3f}")   # Low std = stable model

Stratified k-Fold

For classification, use stratified k-fold to preserve class proportions in each fold. Critical when classes are imbalanced.

Python

from sklearn.model_selection import StratifiedKFold, cross_val_score

# Imbalanced dataset: 10% positive class
X = np.random.randn(500, 20)
y = np.concatenate([np.ones(50), np.zeros(450)])   # 10% positive

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(model, X, y, cv=skf, scoring="roc_auc")

print(f"Stratified CV AUC: {cv_scores.mean():.3f} ± {cv_scores.std():.3f}")
# Each fold has ~10% positive class — more reliable than random k-fold

Leave-One-Out Cross-Validation (LOOCV)

Each sample is used as a validation set of size 1, training on all remaining samples. This is k-fold with k = n_samples.

Python

from sklearn.model_selection import LeaveOneOut

loo = LeaveOneOut()
# n_samples = n_folds — extremely thorough but computationally expensive

# Only practical for very small datasets (n < 100)
cv_scores = cross_val_score(model, X[:50], y[:50], cv=loo, scoring="roc_auc")
print(f"LOO AUC: {cv_scores.mean():.3f}")

Time-Series Cross-Validation

For temporal data, you cannot randomly shuffle. The validation set must always be after the training set in time.

Python

from sklearn.model_selection import TimeSeriesSplit

# Each split: training grows, validation is always next in time
tscv = TimeSeriesSplit(n_splits=5)

for fold, (train_idx, val_idx) in enumerate(tscv.split(X)):
    X_tr, X_val = X[train_idx], X[val_idx]
    y_tr, y_val = y[train_idx], y[val_idx]
    print(f"Fold {fold+1}: train size={len(X_tr)}, val size={len(X_val)}")

# Fold 1: train size= 50, val size=50
# Fold 2: train size=100, val size=50
# Fold 3: train size=150, val size=50
# Fold 4: train size=200, val size=50
# Fold 5: train size=250, val size=50

cv_scores = cross_val_score(model, X, y, cv=tscv, scoring="roc_auc")
print(f"Time-series CV AUC: {cv_scores.mean():.3f}")

Cross-Validation for Hyperparameter Tuning

Python

from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

param_grid = {
    "n_estimators": [50, 100, 200],
    "max_depth":    [3, 5, None],
    "min_samples_split": [2, 5, 10],
}

# GridSearchCV: exhaustive search with CV
grid_search = GridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid,
    cv=5,                   # 5-fold CV per configuration
    scoring="roc_auc",
    n_jobs=-1,              # Parallel
    verbose=1,
)
grid_search.fit(X_trainval, y_trainval)

print(f"Best params: {grid_search.best_params_}")
print(f"Best CV AUC: {grid_search.best_score_:.3f}")

# Best model — test set evaluation
best_model = grid_search.best_estimator_
test_auc = roc_auc_score(y_test, best_model.predict_proba(X_test)[:, 1])
print(f"Test AUC:   {test_auc:.3f}")

Nested Cross-Validation

When using CV for hyperparameter tuning AND model evaluation, use nested CV to avoid optimistic estimates.

Python

from sklearn.model_selection import cross_val_score, GridSearchCV

# Outer loop: estimate generalization
# Inner loop: tune hyperparameters
clf = GridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid={"n_estimators": [50, 100], "max_depth": [3, 5]},
    cv=3,   # Inner CV
    scoring="roc_auc",
)

# Outer CV: unbiased estimate of performance
outer_scores = cross_val_score(clf, X, y, cv=5, scoring="roc_auc")
print(f"Nested CV AUC: {outer_scores.mean():.3f} ± {outer_scores.std():.3f}")
# This is more honest than a single GridSearchCV best_score_

When to Use Which

| Situation | Recommended CV | |---|---| | Large dataset (1000+), balanced | Regular k-fold (k=5 or 10) | | Imbalanced classes | Stratified k-fold | | Small dataset (under 100) | LOOCV or k=10 | | Time series / temporal data | TimeSeriesSplit | | Comparing multiple models | Cross-validation for all | | Reporting final performance | Test set (not CV) |

Interview Answer Template

Q: What is cross-validation and when would you use it?

Cross-validation is a technique for more reliably estimating model performance by training and evaluating on multiple different splits of the data. In k-fold cross-validation, the data is split into k folds, and the model is trained k times — each time using a different fold as the validation set and the rest as training. The k validation scores are averaged for a more stable estimate. I'd use stratified k-fold for imbalanced classification (preserves class ratios in each fold), TimeSeriesSplit for temporal data (training always before validation in time), and LOOCV for very small datasets. Cross-validation is mainly for comparing models and tuning hyperparameters during development — the test set remains locked for the final honest evaluation.

The Test Set: One Shot, Final Score

Next Lesson

What is Overfitting?