LLMs Deep Dive · Lesson 2 of 24
Tokenization and Byte-Pair Encoding
Why Tokenise?
LLMs don't operate on characters or words — they operate on tokens: subword units that balance vocabulary size with representation efficiency:
Too fine (characters):
"Warfarin" → ['W', 'a', 'r', 'f', 'a', 'r', 'i', 'n']
8 tokens for one word — too long, no morphological structure
Too coarse (words):
"Warfarins" → unknown if not in vocabulary
Huge vocabulary, many rare words get [UNK]
Subword (BPE):
"Warfarin" → ['War', 'far', 'in'] or ['Warf', 'arin']
~3 tokens, handles rare words, finite vocabularyByte-Pair Encoding (BPE)
BPE is a data compression algorithm repurposed for subword segmentation:
Training algorithm:
1. Start with character-level vocabulary + end-of-word markers
{"W", "a", "r", "f", "i", "n", " " ...}
2. Count all adjacent symbol pairs in the corpus
3. Merge the most frequent pair into a new symbol
"t" + "h" → "th" (if "th" is most frequent pair)
4. Repeat for N merge operations (N = vocab size - base chars)
5. Final vocabulary = base chars + N merged symbolsThe vocabulary size is a hyperparameter (e.g., 32K for LLaMA, 50K for GPT-2, 100K for LLaMA 3).
BPE Training Example
Corpus (toy): "low low low lower lower newest newest widest"
Initial splits with end marker:
l-o-w-: 5, l-o-w-e-r-: 2, n-e-w-e-s-t-: 6, w-i-d-e-s-t-: 3
Most frequent pair: (e, s) → merge to "es"
n-e-w-es-t-: 6, w-i-d-es-t-: 3
Most frequent pair: (es, t) → merge to "est"
n-e-w-est-: 6, w-i-d-est-: 3
Most frequent pair: (est, ) → merge to "est"
...continues...
After N merges, "newest" might be: ["new", "est"]Byte-Level BPE
GPT-2 and GPT-3 use byte-level BPE: start with 256 bytes (not characters) as the base vocabulary, ensuring any text is representable with no [UNK] token.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
text = "The patient takes Warfarin 5mg daily"
tokens = tokenizer.encode(text)
decoded = [tokenizer.decode([t]) for t in tokens]
# Example output:
# ['The', ' patient', ' takes', ' War', 'far', 'in', ' ', '5', 'mg', ' daily']
# Note: spaces are part of the preceding token (GPT-2 convention)SentencePiece and Unigram
LLaMA uses SentencePiece with a Unigram language model:
Unigram LM approach:
1. Start with a large vocabulary
2. Remove symbols that increase corpus log-likelihood least
3. Repeat until target vocabulary size
Differences from BPE:
- Probabilistic segmentation (multiple segmentations possible with probabilities)
- More principled vocabulary selection
- Language-agnostic (no word boundaries needed — treats spaces as characters)Why Tokenisation Matters
1. Context length efficiency:
"antidisestablishmentarianism" = 1 word = 5 GPT-4 tokens
Asian languages tokenise less efficiently (1 char ≈ 1 token)
Same 4K context window = far fewer "words" for Chinese vs English
2. Arithmetic failures:
"9.11" → ["9", ".", "1", "1"] — model doesn't see the number as a unit
Adding multi-digit numbers requires seeing digit structure via tokens
3. Prompt length costs:
Every token costs compute and memory
Long system prompts are expensive at scale
4. Medical vocabulary:
"hydroxychloroquine" → 4-6 subword tokens
Domain-specific tokenisers (e.g., ClinicalBERT) may segment more naturallyInterview Answer
"Tokenisation converts text into a sequence of integer IDs that the model processes. Byte-Pair Encoding builds a vocabulary by iteratively merging the most frequent character pair in the corpus, starting from bytes or characters and building up to common subwords. The vocabulary size is a hyperparameter (32K–100K). Byte-level BPE (GPT-2, GPT-3) uses all 256 byte values as the base, eliminating unknown tokens. LLaMA uses SentencePiece with a Unigram model. Tokenisation choices affect arithmetic performance, multilingual efficiency, context length utilisation, and medical/scientific vocabulary handling."