Machine Learning Foundations · Lesson 2 of 70

Training, Validation, and Testing — What Each Does

Why Split Data at All?

A model trained and evaluated on the same data will look perfect — it has memorized the answers. Splitting forces the model to prove it has learned generalizable patterns, not just memorized training examples.

The Three Splits

Training Set (~70%)

What it does: the model sees this data during training and adjusts its weights to minimize error on it.

What it doesn't do: tell you how well the model will generalize.

Python

# Conceptually — scikit-learn handles the mechanics
model.fit(X_train, y_train)   # Weights adjusted based on training data

Validation Set (~15%)

What it does: evaluates the model during development — used to tune hyperparameters and decide when to stop training.

Why it's not the training set: you need fresh data to detect overfitting. If you tune hyperparameters using training loss, you'll overfit the training set.

Python

val_score = model.score(X_val, y_val)   # Guides decisions: add regularization? more epochs?

# Typical use: pick the best model configuration
for lr in [0.1, 0.01, 0.001]:
    model = train(X_train, y_train, learning_rate=lr)
    val_acc = evaluate(model, X_val, y_val)
    if val_acc > best_val_acc:
        best_lr = lr

Test Set (~15%)

What it does: provides the final, unbiased estimate of model performance. Used exactly once, after all development is complete.

The critical rule: you must not look at the test set during development. If you tune hyperparameters based on test set performance, you've effectively trained on the test set — your reported accuracy will be optimistic.

Python

# Only run this after all development decisions are finalized
final_score = model.score(X_test, y_test)
print(f"Test accuracy: {final_score:.3f}")   # This number goes in your report

A Concrete Example

Python

from sklearn.model_selection import train_test_split
import numpy as np

# 1000 drug samples with features and labels
X = np.random.randn(1000, 20)
y = np.random.randint(0, 3, 1000)   # 3 drug classes

# First split: hold out 20% for testing
X_trainval, X_test, y_trainval, y_test = train_test_split(
    X, y, test_size=0.20, random_state=42, stratify=y
)

# Second split: 75/25 of remaining → 60% train, 15% val, 25% already test
X_train, X_val, y_train, y_val = train_test_split(
    X_trainval, y_trainval, test_size=0.1875, random_state=42, stratify=y_trainval
)
# 0.1875 * 0.8 ≈ 0.15 of total

print(f"Train:      {len(X_train)} samples")   # ~600
print(f"Validation: {len(X_val)} samples")     # ~150
print(f"Test:       {len(X_test)} samples")    # ~200

# ─── Development phase ───────────────────────────────────────
# Train and tune using only X_train and X_val

# ─── Final evaluation ────────────────────────────────────────
# ONE call to test set at the very end
final_accuracy = model.score(X_test, y_test)

The Stratify Parameter

When classes are imbalanced, random splitting can give each split very different class distributions. stratify=y ensures each split preserves the original class ratios.

Python

# Without stratify: split might give test 40% positive, train 10% positive
# With stratify=y: each split mirrors the overall class distribution

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    stratify=y,       # Critical for imbalanced datasets
    random_state=42,  # Reproducibility
)

Common Mistakes

| Mistake | Why It's Wrong | Fix | |---|---|---| | Evaluate on training data | Optimistic; model memorized answers | Use separate val/test sets | | Tune on test set | "Leaks" info from future; overly optimistic | Use val set for tuning | | Normalize before splitting | Test data statistics contaminate normalization | Split first, then fit normalization on train only | | Look at test set early | Biases your development decisions | Treat test set as locked until the very end |

Fitting Preprocessing on Training Data Only

A subtle but important rule: preprocessing (normalization, imputation, encoding) must be fit on training data only, then applied to val and test.

Python

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)   # Fit on train, then transform

# Apply SAME scaler (same mean/std) to val and test — no fitting here
X_val_scaled  = scaler.transform(X_val)
X_test_scaled = scaler.transform(X_test)

# Why: if you fit on val/test, you leak their statistics into training

Interview Answer Template

Q: What is the difference between training, validation, and test sets?

The training set is what the model learns from — its weights are updated to minimize loss on this data. The validation set is used during development to evaluate how the model generalizes while we're still making decisions (hyperparameter tuning, architecture choices, early stopping). The test set is held out until the very end and used exactly once to report final performance — it simulates how the model will behave on truly unseen data. A critical rule: never use the test set during development or you'll get an optimistically biased estimate of real-world performance.

What is Machine Learning?

Next Lesson

What is a Feature and a Label?