LLMs Deep Dive · Lesson 3 of 24
Pre-training: Next Token Prediction at Scale
Pretraining Objective
LLMs are pretrained with causal language modelling — next-token prediction:
Objective: maximise the log-likelihood of each token given previous tokens
L(θ) = (1/T) Σᵢ log Pθ(token_i | token_0, ..., token_{i-1})
For each sequence "The cat sat on the mat":
Predict "cat" given "The"
Predict "sat" given "The cat"
Predict "on" given "The cat sat"
...all in parallel via causal maskingEvery token contributes a gradient signal. A 2T-token dataset provides 2T training examples from a single pass.
Pretraining Data
LLaMA 2 data mix (2T tokens):
CommonCrawl: 67% (web pages, filtered)
C4: 15% (cleaned web text)
GitHub: 4.5%
Wikipedia: 4.5%
Books: 4.5%
ArXiv: 2.5%
StackExchange: 2%
Key insight: web crawl dominates by volume, but high-quality
sources (Wikipedia, books, code) are overrepresented relative
to their raw proportion — they're upsampled during training.Data Preprocessing Pipeline
1. Download: CommonCrawl WARC files (~100TB per crawl)
2. Deduplication:
- URL-level deduplication
- Near-duplicate removal (MinHash LSH)
- Exact n-gram deduplication
- Removes ~30-60% of web data
3. Quality filtering:
- Language detection (keep English, or multilingual mix)
- Perplexity filtering (remove text with very high perplexity under a small LM)
- Heuristic rules (remove pages with too many special characters, too short, etc.)
4. Harmful content removal:
- URL-based blocklists
- Classifier-based filtering (NSFW, PII, hate speech)
5. Tokenisation: batch encode with BPE/SentencePiece
6. Pack: concatenate and split into fixed-length sequences (e.g., 4096 tokens)
with a separator token between documentsTraining Infrastructure
Hardware:
LLaMA 2 70B: 2000 A100 GPUs × 1,720,000 GPU-hours
Parallelism strategies:
Data parallelism: each GPU sees different data
Tensor parallelism: split attention heads across GPUs
Pipeline parallelism: split layers across GPUs (GPipe, PipeDream)
Sequence parallelism: split long sequences across GPUs (for long context)
Mixed precision: bf16 for compute, fp32 for optimizer states
Gradient clipping: clip norm at 1.0 to prevent gradient spikes
Optimizer: AdamW with cosine learning rate schedule + warmupLearning Rate Schedule
Warmup (linear):
0 → peak_lr over first N steps (N ≈ 2000)
Prevents large gradient updates from random initialisation
Cosine decay:
peak_lr → min_lr (typically peak_lr / 10)
Schedule extends over full training
Final LR multiplier:
LLaMA 2: peak_lr = 3e-4, min_lr = 3e-5
Why cosine: smooth decay avoids abrupt changes;
performance is more stable than step schedulesWhat Pretraining Learns
Knowledge encoded in weights after pretraining:
- Factual knowledge (capitals, dates, drug names, disease symptoms)
- Syntactic patterns (grammar, sentence structure)
- Semantic relationships (synonyms, analogies)
- Code patterns (function signatures, algorithms)
- Reasoning patterns (math steps, logical inferences)
- Stylistic patterns (formal vs informal, domain vocabulary)
NOT learned during pretraining:
- Following human instructions
- Being helpful, harmless, honest
- Refusing dangerous requests
These require instruction fine-tuning + RLHF/DPOCompute Budget Estimation
Using the Chinchilla scaling law approximation:
For a model with N parameters, the optimal training is with ~20·N tokens
Compute C ≈ 6 · N · T FLOPs (rough estimate)
N = model parameters
T = training tokens
LLaMA 2 7B:
C ≈ 6 × 7×10⁹ × 2×10¹² = 8.4×10²² FLOPs ≈ 0.84 × 10²³Interview Answer
"LLM pretraining is causal language modelling: predict the next token given all previous tokens, with cross-entropy loss at every position. Training data is dominated by filtered web crawl (CommonCrawl) supplemented by high-quality sources like Wikipedia, books, and code — high-quality sources are upsampled. Training runs on thousands of GPUs using data, tensor, and pipeline parallelism with mixed precision (bf16). Pretraining encodes factual knowledge, syntactic patterns, and reasoning templates — but not instruction-following or safety behaviours, which require fine-tuning and alignment."