Learnixo
Back to blog
AI Systemsintermediate

The Test Set: One Shot, Final Score

Understand the test set's role: final, unbiased evaluation, why it must be used exactly once, test set contamination risks, and how to report honest model performance for AI systems.

Asma Hafeez KhanMay 16, 20265 min read
Machine LearningTest SetModel EvaluationData LeakageInterview
Share:𝕏

The Golden Rule

Use the test set exactly once, after all development is complete.

The test set exists to answer one question: how will this model perform on new, unseen data in the real world? Every time you look at the test set during development, you're introducing bias into that estimate.


What the Test Set Represents

The test set simulates deployment. It contains data the model has never seen — no training on it, no validation decisions made based on it. Its score is the one number that goes in your paper, report, or production decision.

Python
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, classification_report
import numpy as np

X, y = np.random.randn(2000, 20), np.random.randint(0, 2, 2000)

# 1. Split data  test is locked away immediately
X_trainval, X_test, y_trainval, y_test = train_test_split(
    X, y, test_size=0.15, random_state=42, stratify=y
)

# 2. All development uses only trainval
X_train, X_val, y_train, y_val = train_test_split(X_trainval, y_trainval, test_size=0.18)

# 3. Train and tune (many iterations allowed here)
model = GradientBoostingClassifier(n_estimators=200, max_depth=3, learning_rate=0.05)
model.fit(X_train, y_train)

val_auc = roc_auc_score(y_val, model.predict_proba(X_val)[:, 1])
print(f"Validation AUC: {val_auc:.3f}")

# 4. Final evaluation  run this ONCE, after all decisions are made
test_auc = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])
print(f"Test AUC: {test_auc:.3f}")   # This is the number you report

print("\nClassification Report on Test Set:")
print(classification_report(y_test, model.predict(X_test)))

Test Set Contamination

Contamination occurs when the test set influences any decision during development, making the final score optimistic.

Explicit contamination:

Python
# WRONG: using test set to pick the best model
for model in [model_a, model_b, model_c]:
    test_score = model.score(X_test, y_test)   # Never do this during selection
    if test_score > best:
        best = test_score
        best_model = model

Implicit contamination:

Python
# WRONG: fitting preprocessing on all data (including test)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_all_normalized = scaler.fit_transform(X)   # X includes test data!
X_train = X_all_normalized[:1700]
X_test  = X_all_normalized[1700:]            # Test statistics leaked into normalization

Reporting Test Results Honestly

Python
from sklearn.metrics import (
    roc_auc_score, average_precision_score,
    f1_score, classification_report, confusion_matrix
)

# Compute and report relevant metrics on test set
y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)[:, 1]

report = {
    "test_auc_roc":            roc_auc_score(y_test, y_proba),
    "test_auc_pr":             average_precision_score(y_test, y_proba),
    "test_f1_macro":           f1_score(y_test, y_pred, average="macro"),
    "test_samples":            len(y_test),
    "positive_class_fraction": y_test.mean(),
}

for k, v in report.items():
    print(f"{k}: {v:.4f}" if isinstance(v, float) else f"{k}: {v}")

When the Test Set Reveals Bad News

If the test score is significantly lower than validation, this means:

  1. Validation overfitting — you ran too many experiments and the validation split is no longer representative
  2. Distribution mismatch — test data has a different distribution than train/val (data drift)
  3. Small test set — high variance in test score (wide confidence interval)
Python
gap = val_auc - test_auc

if gap > 0.10:
    print("WARNING: Significant val/test gap. Investigate before deployment:")
    print("  1. Was the test set used previously for any decision?")
    print("  2. Are train/val/test from different time periods or sources?")
    print("  3. Is the test set large enough for a stable estimate?")

Confidence Intervals on Test Metrics

A single test score has uncertainty — especially on small test sets. Bootstrap confidence intervals quantify this.

Python
from sklearn.utils import resample

def bootstrap_auc(y_true, y_pred_proba, n_iterations: int = 1000, ci: float = 0.95):
    """95% confidence interval for AUC-ROC using bootstrap."""
    scores = []
    for _ in range(n_iterations):
        idx = resample(range(len(y_true)))
        score = roc_auc_score(y_true[idx], y_pred_proba[idx])
        scores.append(score)

    scores.sort()
    lo = int((1 - ci) / 2 * n_iterations)
    hi = int((1 + ci) / 2 * n_iterations)
    return scores[lo], np.mean(scores), scores[hi]

lo, mean, hi = bootstrap_auc(np.array(y_test), y_proba)
print(f"AUC-ROC: {mean:.3f} (95% CI: [{lo:.3f}, {hi:.3f}])")

Test Set in Production Thinking

Python
# Production model evaluation is like a continuous test set
# New data arriving daily = fresh test examples
# Monitor performance over time for drift

class ProductionMonitor:
    def __init__(self, model, baseline_auc: float):
        self.model = model
        self.baseline_auc = baseline_auc
        self.recent_scores = []

    def evaluate_batch(self, X_new, y_new):
        """Evaluate model on a new batch — acts like a rolling test set."""
        score = roc_auc_score(y_new, self.model.predict_proba(X_new)[:, 1])
        self.recent_scores.append(score)

        if len(self.recent_scores) >= 30:
            rolling_auc = np.mean(self.recent_scores[-30:])
            if self.baseline_auc - rolling_auc > 0.05:
                print(f"ALERT: AUC dropped from {self.baseline_auc:.3f} to {rolling_auc:.3f}")

Interview Answer Template

Q: Why must the test set be used exactly once?

The test set provides an unbiased estimate of how the model will perform on truly new data — it simulates deployment. Every time you use the test set to make a decision (pick a model, tune a hyperparameter, decide an architecture), you're inadvertently fitting to it. Over multiple decisions, the test set performance becomes optimistically biased — a form of data leakage. This is why the test set must be completely locked away during development. Only the validation set guides decisions. The test set is used once, after all development is complete, to report the final honest performance number. If the test score is significantly lower than validation, it's often a sign that too many decisions were made using information from the test set, or that the validation set is no longer representative.

Enjoyed this article?

Explore the AI Systems learning path for more.

Found this helpful?

Share:𝕏

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.