Machine Learning Foundations · Lesson 20 of 70
Cross-Validation: When to Use It and Why
Why Cross-Validation?
A single train/val split is unreliable on small datasets. If you got lucky (or unlucky) with which examples ended up in the validation set, your performance estimate has high variance.
Cross-validation solves this by evaluating the model on multiple different validation splits and averaging the results.
k-Fold Cross-Validation
Split data into k equal "folds." Train on k-1 folds, validate on the remaining 1. Repeat k times, each time using a different fold as validation. Average the k scores.
k=5 example:
Fold 1: [Train | Train | Train | Train | VAL]
Fold 2: [Train | Train | Train | VAL | Train]
Fold 3: [Train | Train | VAL | Train | Train]
Fold 4: [Train | VAL | Train | Train | Train]
Fold 5: [VAL | Train | Train | Train | Train]
Average the 5 validation scores → more reliable estimatefrom sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, KFold
import numpy as np
X = np.random.randn(300, 20)
y = np.random.randint(0, 2, 300)
model = RandomForestClassifier(n_estimators=100, random_state=42)
# Simple k-fold
cv_scores = cross_val_score(model, X, y, cv=5, scoring="roc_auc")
print(f"AUC scores per fold: {cv_scores.round(3)}")
print(f"Mean AUC: {cv_scores.mean():.3f}")
print(f"Std AUC: {cv_scores.std():.3f}") # Low std = stable modelStratified k-Fold
For classification, use stratified k-fold to preserve class proportions in each fold. Critical when classes are imbalanced.
from sklearn.model_selection import StratifiedKFold, cross_val_score
# Imbalanced dataset: 10% positive class
X = np.random.randn(500, 20)
y = np.concatenate([np.ones(50), np.zeros(450)]) # 10% positive
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(model, X, y, cv=skf, scoring="roc_auc")
print(f"Stratified CV AUC: {cv_scores.mean():.3f} ± {cv_scores.std():.3f}")
# Each fold has ~10% positive class — more reliable than random k-foldLeave-One-Out Cross-Validation (LOOCV)
Each sample is used as a validation set of size 1, training on all remaining samples. This is k-fold with k = n_samples.
from sklearn.model_selection import LeaveOneOut
loo = LeaveOneOut()
# n_samples = n_folds — extremely thorough but computationally expensive
# Only practical for very small datasets (n < 100)
cv_scores = cross_val_score(model, X[:50], y[:50], cv=loo, scoring="roc_auc")
print(f"LOO AUC: {cv_scores.mean():.3f}")Time-Series Cross-Validation
For temporal data, you cannot randomly shuffle. The validation set must always be after the training set in time.
from sklearn.model_selection import TimeSeriesSplit
# Each split: training grows, validation is always next in time
tscv = TimeSeriesSplit(n_splits=5)
for fold, (train_idx, val_idx) in enumerate(tscv.split(X)):
X_tr, X_val = X[train_idx], X[val_idx]
y_tr, y_val = y[train_idx], y[val_idx]
print(f"Fold {fold+1}: train size={len(X_tr)}, val size={len(X_val)}")
# Fold 1: train size= 50, val size=50
# Fold 2: train size=100, val size=50
# Fold 3: train size=150, val size=50
# Fold 4: train size=200, val size=50
# Fold 5: train size=250, val size=50
cv_scores = cross_val_score(model, X, y, cv=tscv, scoring="roc_auc")
print(f"Time-series CV AUC: {cv_scores.mean():.3f}")Cross-Validation for Hyperparameter Tuning
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
param_grid = {
"n_estimators": [50, 100, 200],
"max_depth": [3, 5, None],
"min_samples_split": [2, 5, 10],
}
# GridSearchCV: exhaustive search with CV
grid_search = GridSearchCV(
RandomForestClassifier(random_state=42),
param_grid,
cv=5, # 5-fold CV per configuration
scoring="roc_auc",
n_jobs=-1, # Parallel
verbose=1,
)
grid_search.fit(X_trainval, y_trainval)
print(f"Best params: {grid_search.best_params_}")
print(f"Best CV AUC: {grid_search.best_score_:.3f}")
# Best model — test set evaluation
best_model = grid_search.best_estimator_
test_auc = roc_auc_score(y_test, best_model.predict_proba(X_test)[:, 1])
print(f"Test AUC: {test_auc:.3f}")Nested Cross-Validation
When using CV for hyperparameter tuning AND model evaluation, use nested CV to avoid optimistic estimates.
from sklearn.model_selection import cross_val_score, GridSearchCV
# Outer loop: estimate generalization
# Inner loop: tune hyperparameters
clf = GridSearchCV(
RandomForestClassifier(random_state=42),
param_grid={"n_estimators": [50, 100], "max_depth": [3, 5]},
cv=3, # Inner CV
scoring="roc_auc",
)
# Outer CV: unbiased estimate of performance
outer_scores = cross_val_score(clf, X, y, cv=5, scoring="roc_auc")
print(f"Nested CV AUC: {outer_scores.mean():.3f} ± {outer_scores.std():.3f}")
# This is more honest than a single GridSearchCV best_score_When to Use Which
| Situation | Recommended CV | |---|---| | Large dataset (1000+), balanced | Regular k-fold (k=5 or 10) | | Imbalanced classes | Stratified k-fold | | Small dataset (under 100) | LOOCV or k=10 | | Time series / temporal data | TimeSeriesSplit | | Comparing multiple models | Cross-validation for all | | Reporting final performance | Test set (not CV) |
Interview Answer Template
Q: What is cross-validation and when would you use it?
Cross-validation is a technique for more reliably estimating model performance by training and evaluating on multiple different splits of the data. In k-fold cross-validation, the data is split into k folds, and the model is trained k times — each time using a different fold as the validation set and the rest as training. The k validation scores are averaged for a more stable estimate. I'd use stratified k-fold for imbalanced classification (preserves class ratios in each fold), TimeSeriesSplit for temporal data (training always before validation in time), and LOOCV for very small datasets. Cross-validation is mainly for comparing models and tuning hyperparameters during development — the test set remains locked for the final honest evaluation.