Scaling Laws

What Scaling Laws Are

Scaling laws describe how model performance (measured by loss) changes predictably with scale:

L(N, D) = f(model parameters N, training tokens D, compute C)

Key finding: loss decreases as a power law with each resource

L(N) ∝ N^(-α)   — more parameters → lower loss
L(D) ∝ D^(-β)   — more data → lower loss
L(C) ∝ C^(-γ)   — more compute → lower loss

These relationships hold across many orders of magnitude

This is valuable because it lets teams predict final model quality from small experiments, before committing to expensive full training runs.

Kaplan et al. (2020): OpenAI Scaling Laws

The original scaling law paper found:

Optimal model size for a compute budget C:
  N_opt ∝ C^0.73  (parameters scale faster than data)

Implication: given a fixed compute budget,
  train a LARGER model on FEWER tokens
  (don't worry about convergence — stop early)

GPT-3 was trained following this: 175B params, 300B tokens
  At the time, this was believed to be optimal.

Chinchilla (Hoffmann et al., 2022)

Chinchilla revised the Kaplan findings with a better experimental design:

Chinchilla finding: parameters and tokens should scale EQUALLY

N_opt ≈ D_opt  (params ≈ training tokens as round numbers)

More precisely: optimal ratio is ~20 tokens per parameter

For a model with N parameters, train on ~20N tokens:
  7B model:   train on 140B tokens (Chinchilla recommendation)
  70B model:  train on 1.4T tokens

Before Chinchilla: GPT-3 (175B params, 300B tokens) was undertrained
  Chinchilla optimal: 175B model needs 3.5T tokens

The Compute-Optimal Frontier

For a fixed compute budget C:

Too few parameters, too many tokens:
  Model is underparameterised — runs out of capacity to fit the data

Too many parameters, too few tokens:
  Model is undertrained — each parameter hasn't seen enough signal

Chinchilla optimal:
  Balance between the two — maximises performance per FLOP

Practical implication:
  LLaMA 2 7B trained on 2T tokens (280 tokens/param)
  This is MORE than Chinchilla-optimal for a single training run,
  but optimises for INFERENCE efficiency (smaller model, more capable)

Inference-Optimal vs Training-Optimal

An important distinction that emerged post-Chinchilla:

Training-optimal: minimise loss per training FLOP
  → Scale parameters and data together

Inference-optimal: minimise cost to run the model at deployment
  → Train a SMALLER model on MORE tokens
  → It reaches the same loss as a larger model with fewer params
  → Cheaper to serve (fewer params = fewer MACs per token)

LLaMA's strategy (Touvron et al.):
  Train 7B model on 1T tokens (143 tokens/param)
  Far exceeds Chinchilla training-optimal for 7B
  But matches or exceeds 70B Chinchilla-optimal on many benchmarks
  AND is 10× cheaper to run than 70B

Power Law Fit

Python

import numpy as np

# Approximate loss as function of model size (N) and data (D)
# L ≈ A/N^α + B/D^β + L_∞

def chinchilla_loss(N, D, A=406.4, B=410.7, alpha=0.34, beta=0.28, L_inf=1.69):
    return A / N**alpha + B / D**beta + L_inf

# Example: 7B model, 1T tokens
L = chinchilla_loss(N=7e9, D=1e12)
print(f"Predicted loss: {L:.3f}")  # ~2.1 nats (reasonable for strong 7B)

What Doesn't Scale Predictably

Scaling laws hold for:
  - Next-token prediction loss (perplexity)
  - General downstream task performance (aggregate)

Do NOT scale predictably:
  - Specific reasoning tasks (may emerge abruptly)
  - Safety properties (bigger ≠ safer)
  - Factual accuracy (can hallucinate at any scale)
  - Instruction following (requires fine-tuning)
  - Long-context reasoning (limited by architecture, not just scale)

Interview Answer

"Scaling laws show that LLM loss decreases as a power law with parameters N, training tokens D, and compute C — predictably across orders of magnitude. Kaplan et al. (2020) found that for a compute budget, you should prioritise model size. Chinchilla (2022) revised this: optimal training uses ~20 tokens per parameter, so a 7B model needs ~140B tokens. In practice, teams now train smaller models on much more data than Chinchilla-optimal — a 7B model on 2T tokens — because inference cost (not training cost) dominates at deployment, and a more-trained small model can match a larger, undertrained one."