Learnixo

Machine Learning Foundations · Lesson 18 of 70

The Validation Set: Tuning Without Cheating

What the Validation Set Is For

The validation set is held-out data used during development to evaluate the model without touching the test set. Every decision you make based on performance — which hyperparameter wins, when to stop training, which architecture to pick — should be made using validation data.

Training set:    model weights are updated on this
Validation set:  you use this to make decisions — hyperparameters, stopping, architecture
Test set:        used once at the end — never to make decisions

Hyperparameter Tuning

Python
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
import numpy as np

X, y = np.random.randn(1000, 20), np.random.randint(0, 2, 1000)

X_trainval, X_test, y_trainval, y_test = train_test_split(X, y, test_size=0.15)
X_train, X_val, y_train, y_val = train_test_split(X_trainval, y_trainval, test_size=0.18)

# Try different hyperparameter combinations  evaluate on VALIDATION only
best_auc = 0.0
best_params = {}

for n_estimators in [50, 100, 200]:
    for max_depth in [3, 5, None]:
        model = RandomForestClassifier(
            n_estimators=n_estimators, max_depth=max_depth, random_state=42
        )
        model.fit(X_train, y_train)
        val_auc = roc_auc_score(y_val, model.predict_proba(X_val)[:, 1])

        if val_auc > best_auc:
            best_auc = val_auc
            best_params = {"n_estimators": n_estimators, "max_depth": max_depth}

print(f"Best params: {best_params}")
print(f"Best val AUC: {best_auc:.3f}")
# Test set evaluation happens AFTER this, just once

Early Stopping with Validation Loss

In neural network training, the model is evaluated on the validation set after each epoch. Training stops when validation loss stops improving — before the model overfits.

Python
import torch
import torch.nn as nn

def train_with_early_stopping(model, train_loader, val_loader, patience: int = 5):
    optimizer = torch.optim.Adam(model.parameters())
    criterion = nn.CrossEntropyLoss()

    best_val_loss = float("inf")
    no_improve = 0
    best_weights = None

    for epoch in range(200):
        # Training phase
        model.train()
        for X_batch, y_batch in train_loader:
            optimizer.zero_grad()
            loss = criterion(model(X_batch), y_batch)
            loss.backward()
            optimizer.step()

        # Validation phase  no gradient updates
        model.eval()
        val_loss = 0.0
        with torch.no_grad():
            for X_val, y_val in val_loader:
                val_loss += criterion(model(X_val), y_val).item()

        val_loss /= len(val_loader)

        if val_loss < best_val_loss - 1e-4:
            best_val_loss = val_loss
            best_weights = {k: v.clone() for k, v in model.state_dict().items()}
            no_improve = 0
        else:
            no_improve += 1
            if no_improve >= patience:
                print(f"Early stopping at epoch {epoch}")
                model.load_state_dict(best_weights)   # Restore best weights
                break

Model Selection

Python
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC

models = {
    "LogisticRegression": LogisticRegression(),
    "RandomForest":       RandomForestClassifier(n_estimators=100),
    "GradientBoosting":   GradientBoostingClassifier(n_estimators=100),
    "SVM":                SVC(probability=True),
}

val_scores = {}
for name, model in models.items():
    model.fit(X_train, y_train)
    val_scores[name] = roc_auc_score(y_val, model.predict_proba(X_val)[:, 1])
    print(f"{name}: {val_scores[name]:.3f}")

# Pick the best model based on validation AUC
best_name = max(val_scores, key=val_scores.get)
best_model = models[best_name]
print(f"\nSelected: {best_name}")

# Now test just once
print(f"Test AUC: {roc_auc_score(y_test, best_model.predict_proba(X_test)[:, 1]):.3f}")

Validation Leakage (Hyperparameter Overfitting)

If you run many experiments and always pick the best validation score, you may be inadvertently overfitting the validation set. Over enough experiments, you'll find a configuration that scored well by chance.

Run 100 experiments, always pick best validation score
→ Even with random features, one configuration will look good on validation
→ Test set reveals the truth: no better than baseline

Signs of validation leakage:
  - Validation score >> test score
  - Model performed well on validation but poorly in production

Mitigations:

  • Use k-fold cross-validation instead of a single validation split
  • Hold out the test set strictly until the final evaluation
  • Limit the number of experiments (Bayesian optimization over exhaustive grid search)

When a Single Validation Split Isn't Enough

For small datasets (fewer than a few hundred samples), a single validation split may not be representative. Use k-fold cross-validation instead.

Python
from sklearn.model_selection import cross_val_score

model = RandomForestClassifier(n_estimators=100, random_state=42)

# k-fold: train on k-1 folds, validate on 1, repeat k times
cv_scores = cross_val_score(model, X_trainval, y_trainval, cv=5, scoring="roc_auc")
print(f"CV AUC: {cv_scores.mean():.3f} ± {cv_scores.std():.3f}")
# More reliable than single validation split

Interview Answer Template

Q: What is the validation set used for?

The validation set is held-out data used during model development to evaluate performance without touching the test set. It guides decisions like hyperparameter tuning (which configuration scores best), architecture selection (which model to pick), and early stopping (when to stop training before the model overfits). The key constraint is that the validation set must never influence model weights directly — only inform your choices as a developer. A subtle risk is "validation overfitting": if you run enough experiments and always pick the best validation score, you can overfit the validation set itself. For small datasets, k-fold cross-validation is more reliable than a single validation split. The test set remains completely isolated until you're done with all development decisions.