LLMs Deep Dive · Lesson 4 of 24
Scaling Laws: Compute, Data, and Parameters
What Scaling Laws Are
Scaling laws describe how model performance (measured by loss) changes predictably with scale:
L(N, D) = f(model parameters N, training tokens D, compute C)
Key finding: loss decreases as a power law with each resource
L(N) ∝ N^(-α) — more parameters → lower loss
L(D) ∝ D^(-β) — more data → lower loss
L(C) ∝ C^(-γ) — more compute → lower loss
These relationships hold across many orders of magnitudeThis is valuable because it lets teams predict final model quality from small experiments, before committing to expensive full training runs.
Kaplan et al. (2020): OpenAI Scaling Laws
The original scaling law paper found:
Optimal model size for a compute budget C:
N_opt ∝ C^0.73 (parameters scale faster than data)
Implication: given a fixed compute budget,
train a LARGER model on FEWER tokens
(don't worry about convergence — stop early)
GPT-3 was trained following this: 175B params, 300B tokens
At the time, this was believed to be optimal.Chinchilla (Hoffmann et al., 2022)
Chinchilla revised the Kaplan findings with a better experimental design:
Chinchilla finding: parameters and tokens should scale EQUALLY
N_opt ≈ D_opt (params ≈ training tokens as round numbers)
More precisely: optimal ratio is ~20 tokens per parameter
For a model with N parameters, train on ~20N tokens:
7B model: train on 140B tokens (Chinchilla recommendation)
70B model: train on 1.4T tokens
Before Chinchilla: GPT-3 (175B params, 300B tokens) was undertrained
Chinchilla optimal: 175B model needs 3.5T tokensThe Compute-Optimal Frontier
For a fixed compute budget C:
Too few parameters, too many tokens:
Model is underparameterised — runs out of capacity to fit the data
Too many parameters, too few tokens:
Model is undertrained — each parameter hasn't seen enough signal
Chinchilla optimal:
Balance between the two — maximises performance per FLOP
Practical implication:
LLaMA 2 7B trained on 2T tokens (280 tokens/param)
This is MORE than Chinchilla-optimal for a single training run,
but optimises for INFERENCE efficiency (smaller model, more capable)Inference-Optimal vs Training-Optimal
An important distinction that emerged post-Chinchilla:
Training-optimal: minimise loss per training FLOP
→ Scale parameters and data together
Inference-optimal: minimise cost to run the model at deployment
→ Train a SMALLER model on MORE tokens
→ It reaches the same loss as a larger model with fewer params
→ Cheaper to serve (fewer params = fewer MACs per token)
LLaMA's strategy (Touvron et al.):
Train 7B model on 1T tokens (143 tokens/param)
Far exceeds Chinchilla training-optimal for 7B
But matches or exceeds 70B Chinchilla-optimal on many benchmarks
AND is 10× cheaper to run than 70BPower Law Fit
import numpy as np
# Approximate loss as function of model size (N) and data (D)
# L ≈ A/N^α + B/D^β + L_∞
def chinchilla_loss(N, D, A=406.4, B=410.7, alpha=0.34, beta=0.28, L_inf=1.69):
return A / N**alpha + B / D**beta + L_inf
# Example: 7B model, 1T tokens
L = chinchilla_loss(N=7e9, D=1e12)
print(f"Predicted loss: {L:.3f}") # ~2.1 nats (reasonable for strong 7B)What Doesn't Scale Predictably
Scaling laws hold for:
- Next-token prediction loss (perplexity)
- General downstream task performance (aggregate)
Do NOT scale predictably:
- Specific reasoning tasks (may emerge abruptly)
- Safety properties (bigger ≠ safer)
- Factual accuracy (can hallucinate at any scale)
- Instruction following (requires fine-tuning)
- Long-context reasoning (limited by architecture, not just scale)Interview Answer
"Scaling laws show that LLM loss decreases as a power law with parameters N, training tokens D, and compute C — predictably across orders of magnitude. Kaplan et al. (2020) found that for a compute budget, you should prioritise model size. Chinchilla (2022) revised this: optimal training uses ~20 tokens per parameter, so a 7B model needs ~140B tokens. In practice, teams now train smaller models on much more data than Chinchilla-optimal — a 7B model on 2T tokens — because inference cost (not training cost) dominates at deployment, and a more-trained small model can match a larger, undertrained one."