The Training Set: What It Does and Doesn't Do

What the Training Set Does

The training set is the data the model sees and learns from. During training, the algorithm adjusts the model's weights (parameters) to minimize the loss on this data.

Python

from sklearn.ensemble import RandomForestClassifier

# Model weights are updated based on training data
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)   # Weights optimized on X_train, y_train only

# Training score: how well the model fits the data it learned from
train_score = model.score(X_train, y_train)
print(f"Training accuracy: {train_score:.2%}")   # Often 95-100% for complex models

What the Training Set Doesn't Tell You

High training accuracy is not evidence of a good model. A sufficiently complex model can memorize the training set entirely — scoring 100% — without learning anything generalizable.

Python

from sklearn.tree import DecisionTreeClassifier
import numpy as np

np.random.seed(42)
X_train = np.random.randn(100, 5)
y_train = np.random.randint(0, 2, 100)   # Random labels!

X_test = np.random.randn(20, 5)
y_test = np.random.randint(0, 2, 20)

# Unconstrained tree: can grow deep enough to memorize every training point
overfit_tree = DecisionTreeClassifier(max_depth=None)
overfit_tree.fit(X_train, y_train)

print(f"Training accuracy: {overfit_tree.score(X_train, y_train):.2%}")  # 100%
print(f"Test accuracy:     {overfit_tree.score(X_test, y_test):.2%}")    # ~50% (random)

The training score tells you how well the model fits its training data — not how well it will generalize.

Training vs Generalization Gap

The gap between training score and validation/test score is the clearest diagnostic for overfitting.

Perfect model:
  Training accuracy: 91%
  Validation accuracy: 90%
  Gap: 1% — excellent

Mild overfitting:
  Training accuracy: 97%
  Validation accuracy: 84%
  Gap: 13% — reduce model complexity or add regularization

Severe overfitting (memorization):
  Training accuracy: 100%
  Validation accuracy: 62%
  Gap: 38% — model is useless in production

Python

def diagnose_fit(model, X_train, y_train, X_val, y_val) -> str:
    train_acc = model.score(X_train, y_train)
    val_acc   = model.score(X_val, y_val)
    gap = train_acc - val_acc

    if train_acc < 0.75 and val_acc < 0.75:
        return f"Underfitting — training acc {train_acc:.0%}, val acc {val_acc:.0%}"
    elif gap < 0.05:
        return f"Good fit — gap is only {gap:.1%}"
    elif gap < 0.15:
        return f"Mild overfitting — gap {gap:.1%}"
    else:
        return f"Severe overfitting — gap {gap:.1%}"

What Affects Training Set Performance

Model complexity — deeper/wider models fit training data more easily
Training set size — more data makes memorization harder relative to learning
Regularization — L1/L2/Dropout penalize model for fitting too tightly
Noise in labels — mislabeled examples create a ceiling for training accuracy (and that ceiling is appropriate)
Feature quality — good features make it easier for the model to learn meaningful patterns

The Right Interpretation

Python

results = {
    "Training accuracy":   0.98,
    "Validation accuracy": 0.87,
    "Test accuracy":       0.86,
}

# What this tells you:
# - Model has learned well (87% val accuracy is good)
# - Some overfitting (98% train vs 87% val — 11% gap)
# - Generalizes reasonably (val ≈ test — no validation overfitting)

# What training accuracy of 98% does NOT mean:
# - Model will achieve 98% in production
# - Model is better than one with 90% training accuracy

Training Set Size: More Data Usually Wins

With more training data:

Complex models have less room to memorize individual examples
The model sees more diverse patterns, improving generalization
The gap between training and validation accuracy typically shrinks

Python

import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Simulate learning curves
np.random.seed(42)
X_full = np.random.randn(10000, 20)
y_full = (X_full[:, 0] + X_full[:, 1] > 0).astype(int)

train_sizes = [100, 500, 1000, 5000, 8000]
for n in train_sizes:
    X_tr = X_full[:n]
    y_tr = y_full[:n]
    X_te = X_full[8000:]
    y_te = y_full[8000:]

    model = LogisticRegression()
    model.fit(X_tr, y_tr)
    print(f"n={n:4d}: train={model.score(X_tr, y_tr):.2%}, test={model.score(X_te, y_te):.2%}")

# n= 100: train=88.00%, test=72.00%   — high gap, small dataset
# n= 500: train=85.40%, test=79.00%   — gap narrowing
# n=5000: train=82.06%, test=82.20%   — almost no gap

Interview Answer Template

Q: What does a high training accuracy tell you about a model?

On its own, high training accuracy tells you very little — a sufficiently complex model can memorize every training example and score 100% without learning anything useful. What matters is the gap between training and validation accuracy. A small gap (less than a few percent) suggests good generalization. A large gap suggests overfitting: the model has memorized training noise rather than learning underlying patterns. The training set is what the model optimizes against, so it's expected to score well on it. The meaningful questions are how it performs on held-out validation data and whether the training-validation gap is small enough to trust deployment.

The Training Set: What It Does and Doesn't Do

What the Training Set Does

What the Training Set Doesn't Tell You

Training vs Generalization Gap

What Affects Training Set Performance

The Right Interpretation

Training Set Size: More Data Usually Wins

Interview Answer Template

Enjoyed this article?

Leave a comment