Learnixo

LLMs Deep Dive · Lesson 3 of 24

Pre-training: Next Token Prediction at Scale

Pretraining Objective

LLMs are pretrained with causal language modelling — next-token prediction:

Objective: maximise the log-likelihood of each token given previous tokens

L(θ) = (1/T) Σᵢ log Pθ(token_i | token_0, ..., token_{i-1})

For each sequence "The cat sat on the mat":
  Predict "cat" given "The"
  Predict "sat" given "The cat"
  Predict "on"  given "The cat sat"
  ...all in parallel via causal masking

Every token contributes a gradient signal. A 2T-token dataset provides 2T training examples from a single pass.


Pretraining Data

LLaMA 2 data mix (2T tokens):
  CommonCrawl:    67%  (web pages, filtered)
  C4:             15%  (cleaned web text)
  GitHub:          4.5%
  Wikipedia:       4.5%
  Books:           4.5%
  ArXiv:           2.5%
  StackExchange:   2%

Key insight: web crawl dominates by volume, but high-quality
sources (Wikipedia, books, code) are overrepresented relative
to their raw proportion — they're upsampled during training.

Data Preprocessing Pipeline

1. Download: CommonCrawl WARC files (~100TB per crawl)

2. Deduplication:
   - URL-level deduplication
   - Near-duplicate removal (MinHash LSH)
   - Exact n-gram deduplication
   - Removes ~30-60% of web data

3. Quality filtering:
   - Language detection (keep English, or multilingual mix)
   - Perplexity filtering (remove text with very high perplexity under a small LM)
   - Heuristic rules (remove pages with too many special characters, too short, etc.)

4. Harmful content removal:
   - URL-based blocklists
   - Classifier-based filtering (NSFW, PII, hate speech)

5. Tokenisation: batch encode with BPE/SentencePiece

6. Pack: concatenate and split into fixed-length sequences (e.g., 4096 tokens)
   with a separator token between documents

Training Infrastructure

Hardware:
  LLaMA 2 70B: 2000 A100 GPUs × 1,720,000 GPU-hours

Parallelism strategies:
  Data parallelism:      each GPU sees different data
  Tensor parallelism:    split attention heads across GPUs
  Pipeline parallelism:  split layers across GPUs (GPipe, PipeDream)
  Sequence parallelism:  split long sequences across GPUs (for long context)

Mixed precision: bf16 for compute, fp32 for optimizer states
Gradient clipping: clip norm at 1.0 to prevent gradient spikes
Optimizer: AdamW with cosine learning rate schedule + warmup

Learning Rate Schedule

Warmup (linear):
  0 → peak_lr over first N steps (N ≈ 2000)
  Prevents large gradient updates from random initialisation

Cosine decay:
  peak_lr → min_lr (typically peak_lr / 10)
  Schedule extends over full training

Final LR multiplier:
  LLaMA 2: peak_lr = 3e-4, min_lr = 3e-5

Why cosine: smooth decay avoids abrupt changes;
            performance is more stable than step schedules

What Pretraining Learns

Knowledge encoded in weights after pretraining:
  - Factual knowledge (capitals, dates, drug names, disease symptoms)
  - Syntactic patterns (grammar, sentence structure)
  - Semantic relationships (synonyms, analogies)
  - Code patterns (function signatures, algorithms)
  - Reasoning patterns (math steps, logical inferences)
  - Stylistic patterns (formal vs informal, domain vocabulary)

NOT learned during pretraining:
  - Following human instructions
  - Being helpful, harmless, honest
  - Refusing dangerous requests
  These require instruction fine-tuning + RLHF/DPO

Compute Budget Estimation

Using the Chinchilla scaling law approximation:

For a model with N parameters, the optimal training is with ~20·N tokens

Compute C ≈ 6 · N · T  FLOPs  (rough estimate)
  N = model parameters
  T = training tokens

LLaMA 2 7B:
  C ≈ 6 × 7×10⁹ × 2×10¹² = 8.4×10²² FLOPs ≈ 0.84 × 10²³

Interview Answer

"LLM pretraining is causal language modelling: predict the next token given all previous tokens, with cross-entropy loss at every position. Training data is dominated by filtered web crawl (CommonCrawl) supplemented by high-quality sources like Wikipedia, books, and code — high-quality sources are upsampled. Training runs on thousands of GPUs using data, tensor, and pipeline parallelism with mixed precision (bf16). Pretraining encodes factual knowledge, syntactic patterns, and reasoning templates — but not instruction-following or safety behaviours, which require fine-tuning and alignment."