Learnixo
Back to blog
AI Systemsintermediate

The Training Set: What It Does and Doesn't Do

Understand the role of the training set: what the model learns from it, why high training accuracy is meaningless alone, and what the training set tells and doesn't tell you about real-world performance.

Asma Hafeez KhanMay 16, 20264 min read
Machine LearningTraining SetOverfittingGeneralizationInterview
Share:š•

What the Training Set Does

The training set is the data the model sees and learns from. During training, the algorithm adjusts the model's weights (parameters) to minimize the loss on this data.

Python
from sklearn.ensemble import RandomForestClassifier

# Model weights are updated based on training data
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)   # Weights optimized on X_train, y_train only

# Training score: how well the model fits the data it learned from
train_score = model.score(X_train, y_train)
print(f"Training accuracy: {train_score:.2%}")   # Often 95-100% for complex models

What the Training Set Doesn't Tell You

High training accuracy is not evidence of a good model. A sufficiently complex model can memorize the training set entirely — scoring 100% — without learning anything generalizable.

Python
from sklearn.tree import DecisionTreeClassifier
import numpy as np

np.random.seed(42)
X_train = np.random.randn(100, 5)
y_train = np.random.randint(0, 2, 100)   # Random labels!

X_test = np.random.randn(20, 5)
y_test = np.random.randint(0, 2, 20)

# Unconstrained tree: can grow deep enough to memorize every training point
overfit_tree = DecisionTreeClassifier(max_depth=None)
overfit_tree.fit(X_train, y_train)

print(f"Training accuracy: {overfit_tree.score(X_train, y_train):.2%}")  # 100%
print(f"Test accuracy:     {overfit_tree.score(X_test, y_test):.2%}")    # ~50% (random)

The training score tells you how well the model fits its training data — not how well it will generalize.


Training vs Generalization Gap

The gap between training score and validation/test score is the clearest diagnostic for overfitting.

Perfect model:
  Training accuracy: 91%
  Validation accuracy: 90%
  Gap: 1% — excellent

Mild overfitting:
  Training accuracy: 97%
  Validation accuracy: 84%
  Gap: 13% — reduce model complexity or add regularization

Severe overfitting (memorization):
  Training accuracy: 100%
  Validation accuracy: 62%
  Gap: 38% — model is useless in production
Python
def diagnose_fit(model, X_train, y_train, X_val, y_val) -> str:
    train_acc = model.score(X_train, y_train)
    val_acc   = model.score(X_val, y_val)
    gap = train_acc - val_acc

    if train_acc < 0.75 and val_acc < 0.75:
        return f"Underfitting — training acc {train_acc:.0%}, val acc {val_acc:.0%}"
    elif gap < 0.05:
        return f"Good fit — gap is only {gap:.1%}"
    elif gap < 0.15:
        return f"Mild overfitting — gap {gap:.1%}"
    else:
        return f"Severe overfitting — gap {gap:.1%}"

What Affects Training Set Performance

  1. Model complexity — deeper/wider models fit training data more easily
  2. Training set size — more data makes memorization harder relative to learning
  3. Regularization — L1/L2/Dropout penalize model for fitting too tightly
  4. Noise in labels — mislabeled examples create a ceiling for training accuracy (and that ceiling is appropriate)
  5. Feature quality — good features make it easier for the model to learn meaningful patterns

The Right Interpretation

Python
results = {
    "Training accuracy":   0.98,
    "Validation accuracy": 0.87,
    "Test accuracy":       0.86,
}

# What this tells you:
# - Model has learned well (87% val accuracy is good)
# - Some overfitting (98% train vs 87% val — 11% gap)
# - Generalizes reasonably (val ā‰ˆ test — no validation overfitting)

# What training accuracy of 98% does NOT mean:
# - Model will achieve 98% in production
# - Model is better than one with 90% training accuracy

Training Set Size: More Data Usually Wins

With more training data:

  • Complex models have less room to memorize individual examples
  • The model sees more diverse patterns, improving generalization
  • The gap between training and validation accuracy typically shrinks
Python
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Simulate learning curves
np.random.seed(42)
X_full = np.random.randn(10000, 20)
y_full = (X_full[:, 0] + X_full[:, 1] > 0).astype(int)

train_sizes = [100, 500, 1000, 5000, 8000]
for n in train_sizes:
    X_tr = X_full[:n]
    y_tr = y_full[:n]
    X_te = X_full[8000:]
    y_te = y_full[8000:]

    model = LogisticRegression()
    model.fit(X_tr, y_tr)
    print(f"n={n:4d}: train={model.score(X_tr, y_tr):.2%}, test={model.score(X_te, y_te):.2%}")

# n= 100: train=88.00%, test=72.00%   — high gap, small dataset
# n= 500: train=85.40%, test=79.00%   — gap narrowing
# n=5000: train=82.06%, test=82.20%   — almost no gap

Interview Answer Template

Q: What does a high training accuracy tell you about a model?

On its own, high training accuracy tells you very little — a sufficiently complex model can memorize every training example and score 100% without learning anything useful. What matters is the gap between training and validation accuracy. A small gap (less than a few percent) suggests good generalization. A large gap suggests overfitting: the model has memorized training noise rather than learning underlying patterns. The training set is what the model optimizes against, so it's expected to score well on it. The meaningful questions are how it performs on held-out validation data and whether the training-validation gap is small enough to trust deployment.

Enjoyed this article?

Explore the AI Systems learning path for more.

Found this helpful?

Share:š•

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.