Tokenisation and Byte-Pair Encoding
How text is split into tokens, how BPE builds its vocabulary, why the choice of tokeniser matters, and how to inspect tokenisations in practice.
Why Tokenise?
LLMs don't operate on characters or words — they operate on tokens: subword units that balance vocabulary size with representation efficiency:
Too fine (characters):
"Warfarin" → ['W', 'a', 'r', 'f', 'a', 'r', 'i', 'n']
8 tokens for one word — too long, no morphological structure
Too coarse (words):
"Warfarins" → unknown if not in vocabulary
Huge vocabulary, many rare words get [UNK]
Subword (BPE):
"Warfarin" → ['War', 'far', 'in'] or ['Warf', 'arin']
~3 tokens, handles rare words, finite vocabularyByte-Pair Encoding (BPE)
BPE is a data compression algorithm repurposed for subword segmentation:
Training algorithm:
1. Start with character-level vocabulary + end-of-word markers
{"W", "a", "r", "f", "i", "n", " " ...}
2. Count all adjacent symbol pairs in the corpus
3. Merge the most frequent pair into a new symbol
"t" + "h" → "th" (if "th" is most frequent pair)
4. Repeat for N merge operations (N = vocab size - base chars)
5. Final vocabulary = base chars + N merged symbolsThe vocabulary size is a hyperparameter (e.g., 32K for LLaMA, 50K for GPT-2, 100K for LLaMA 3).
BPE Training Example
Corpus (toy): "low low low lower lower newest newest widest"
Initial splits with end marker:
l-o-w-: 5, l-o-w-e-r-: 2, n-e-w-e-s-t-: 6, w-i-d-e-s-t-: 3
Most frequent pair: (e, s) → merge to "es"
n-e-w-es-t-: 6, w-i-d-es-t-: 3
Most frequent pair: (es, t) → merge to "est"
n-e-w-est-: 6, w-i-d-est-: 3
Most frequent pair: (est, ) → merge to "est"
...continues...
After N merges, "newest" might be: ["new", "est"]Byte-Level BPE
GPT-2 and GPT-3 use byte-level BPE: start with 256 bytes (not characters) as the base vocabulary, ensuring any text is representable with no [UNK] token.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
text = "The patient takes Warfarin 5mg daily"
tokens = tokenizer.encode(text)
decoded = [tokenizer.decode([t]) for t in tokens]
# Example output:
# ['The', ' patient', ' takes', ' War', 'far', 'in', ' ', '5', 'mg', ' daily']
# Note: spaces are part of the preceding token (GPT-2 convention)SentencePiece and Unigram
LLaMA uses SentencePiece with a Unigram language model:
Unigram LM approach:
1. Start with a large vocabulary
2. Remove symbols that increase corpus log-likelihood least
3. Repeat until target vocabulary size
Differences from BPE:
- Probabilistic segmentation (multiple segmentations possible with probabilities)
- More principled vocabulary selection
- Language-agnostic (no word boundaries needed — treats spaces as characters)Why Tokenisation Matters
1. Context length efficiency:
"antidisestablishmentarianism" = 1 word = 5 GPT-4 tokens
Asian languages tokenise less efficiently (1 char ≈ 1 token)
Same 4K context window = far fewer "words" for Chinese vs English
2. Arithmetic failures:
"9.11" → ["9", ".", "1", "1"] — model doesn't see the number as a unit
Adding multi-digit numbers requires seeing digit structure via tokens
3. Prompt length costs:
Every token costs compute and memory
Long system prompts are expensive at scale
4. Medical vocabulary:
"hydroxychloroquine" → 4-6 subword tokens
Domain-specific tokenisers (e.g., ClinicalBERT) may segment more naturallyInterview Answer
"Tokenisation converts text into a sequence of integer IDs that the model processes. Byte-Pair Encoding builds a vocabulary by iteratively merging the most frequent character pair in the corpus, starting from bytes or characters and building up to common subwords. The vocabulary size is a hyperparameter (32K–100K). Byte-level BPE (GPT-2, GPT-3) uses all 256 byte values as the base, eliminating unknown tokens. LLaMA uses SentencePiece with a Unigram model. Tokenisation choices affect arithmetic performance, multilingual efficiency, context length utilisation, and medical/scientific vocabulary handling."
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.