Machine Learning Foundations · Lesson 16 of 70
Why Do We Split Data?
The Fundamental Problem
If you train and evaluate on the same data, the model can simply memorize training examples and score perfectly — without learning anything useful.
from sklearn.tree import DecisionTreeClassifier
import numpy as np
np.random.seed(42)
X = np.random.randn(100, 5)
y = np.random.randint(0, 2, 100)
# Train and evaluate on SAME data — don't do this
model = DecisionTreeClassifier() # No depth limit — can memorize perfectly
model.fit(X, y)
train_score = model.score(X, y)
print(f"Score on training data: {train_score:.2f}") # 1.00 — perfect but meaningless
# Correct: evaluate on unseen data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model.fit(X_train, y_train)
test_score = model.score(X_test, y_test)
print(f"Score on test data: {test_score:.2f}") # ~0.50 — the truth: random guessingWhat Each Split Does
Training Set
The model sees this data and adjusts weights to minimize loss. High accuracy here is expected and not meaningful on its own.
Validation Set
Evaluates the model while you're still making decisions — choosing hyperparameters, comparing architectures, deciding when to stop training. This data is used many times during development.
Test Set
Used once to report final performance. Never used to guide any decisions during development.
Development phase: uses train + validation
Final evaluation phase: uses test (one time only)Why the Validation Set Alone Isn't Enough
If you tune hyperparameters on the validation set, you're indirectly fitting to it. Over many experiments, you'll find configurations that happen to score well on validation but are overfitting that particular split.
# Scenario: you run 50 experiments, each with different hyperparameters
# Each time you pick the setting with the best validation score
# Eventually you may get lucky on validation but the pattern doesn't generalize
# The test set catches this: it was never seen during any of those 50 experiments
# This is called "validation overfitting" or "hyperparameter leakage"The test set provides an unbiased estimate precisely because no decisions were made using it.
Data Leakage — The Silent Killer
Data leakage occurs when information that wouldn't be available at prediction time is used during training, making training metrics misleadingly optimistic.
# Classic leakage: normalize BEFORE splitting
from sklearn.preprocessing import StandardScaler
# WRONG: normalize on all data including test
scaler = StandardScaler()
X_normalized = scaler.fit_transform(X) # Test data statistics contaminate normalization!
X_train, X_test, y_train, y_test = train_test_split(X_normalized, y, test_size=0.2)
# CORRECT: split first, normalize only on train
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
scaler = StandardScaler()
X_train_norm = scaler.fit_transform(X_train) # Fit on train only
X_test_norm = scaler.transform(X_test) # Apply to testOther leakage sources:
- Including features derived from future information (e.g., "days until patient was readmitted" as a feature for readmission prediction)
- Duplicate rows that appear in both train and test
- Target encoding computed on the entire dataset before splitting
Temporal Data: Always Split by Time
For time-series and sequential data (clinical records, financial data, user activity logs), random splitting is wrong. It would allow the model to train on future data and predict the past — a data leak.
import pandas as pd
# WRONG: random split on temporal data
df = pd.DataFrame({
"date": pd.date_range("2020-01-01", periods=1000, freq="D"),
"features": np.random.randn(1000, 5).tolist(),
"label": np.random.randint(0, 2, 1000),
})
# CORRECT: split by time
train_df = df[df.date < "2023-01-01"]
val_df = df[(df.date >= "2023-01-01") & (df.date < "2023-07-01")]
test_df = df[df.date >= "2023-07-01"]
# Training data is always earlier in time than validation and testThe Right Process End-to-End
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
# 1. Load raw data
X_raw = np.random.randn(1000, 20)
y = np.random.randint(0, 2, 1000)
# 2. Split FIRST — before any preprocessing
X_trainval, X_test, y_trainval, y_test = train_test_split(
X_raw, y, test_size=0.15, random_state=42, stratify=y
)
X_train, X_val, y_train, y_val = train_test_split(
X_trainval, y_trainval, test_size=0.176, random_state=42, stratify=y_trainval
)
# 0.176 * 0.85 ≈ 0.15 of total
# 3. Fit preprocessing on train only
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_val_s = scaler.transform(X_val)
X_test_s = scaler.transform(X_test)
# 4. Train and tune using train + val
model = LogisticRegression(C=0.1)
model.fit(X_train_s, y_train)
val_auc = roc_auc_score(y_val, model.predict_proba(X_val_s)[:, 1])
print(f"Validation AUC: {val_auc:.3f}")
# 5. Final evaluation on test — one time only
test_auc = roc_auc_score(y_test, model.predict_proba(X_test_s)[:, 1])
print(f"Test AUC: {test_auc:.3f}") # Report thisInterview Answer Template
Q: Why do we split data into train, validation, and test sets?
We split data to get an honest estimate of how well the model will perform on new, unseen data. Training on and evaluating with the same data is meaningless — a model can memorize training examples and appear perfect without learning anything useful. The validation set is used during development to guide decisions like hyperparameter tuning and architecture choices. The test set is held out completely until the end and used once to report final performance. The key rule is that no preprocessing (normalization, imputation, encoding) should be fit on validation or test data — only on training data. For temporal data, we always split by time to prevent the model from training on future events and predicting the past.