Learnixo
Back to blog
AI Systemsintermediate

Tokenisation and Byte-Pair Encoding

How text is split into tokens, how BPE builds its vocabulary, why the choice of tokeniser matters, and how to inspect tokenisations in practice.

Asma Hafeez KhanMay 16, 20264 min read
LLMsTokenisationBPENLPInterview
Share:𝕏

Why Tokenise?

LLMs don't operate on characters or words — they operate on tokens: subword units that balance vocabulary size with representation efficiency:

Too fine (characters):
  "Warfarin" → ['W', 'a', 'r', 'f', 'a', 'r', 'i', 'n']
  8 tokens for one word — too long, no morphological structure

Too coarse (words):
  "Warfarins" → unknown if not in vocabulary
  Huge vocabulary, many rare words get [UNK]

Subword (BPE):
  "Warfarin" → ['War', 'far', 'in']  or  ['Warf', 'arin']
  ~3 tokens, handles rare words, finite vocabulary

Byte-Pair Encoding (BPE)

BPE is a data compression algorithm repurposed for subword segmentation:

Training algorithm:
1. Start with character-level vocabulary + end-of-word markers
   {"W", "a", "r", "f", "i", "n", " " ...}

2. Count all adjacent symbol pairs in the corpus

3. Merge the most frequent pair into a new symbol
   "t" + "h" → "th"  (if "th" is most frequent pair)

4. Repeat for N merge operations (N = vocab size - base chars)

5. Final vocabulary = base chars + N merged symbols

The vocabulary size is a hyperparameter (e.g., 32K for LLaMA, 50K for GPT-2, 100K for LLaMA 3).


BPE Training Example

Corpus (toy): "low low low lower lower newest newest widest"

Initial splits with end marker:
  l-o-w-: 5, l-o-w-e-r-: 2, n-e-w-e-s-t-: 6, w-i-d-e-s-t-: 3

Most frequent pair: (e, s) → merge to "es"
  n-e-w-es-t-: 6, w-i-d-es-t-: 3

Most frequent pair: (es, t) → merge to "est"
  n-e-w-est-: 6, w-i-d-est-: 3

Most frequent pair: (est, ) → merge to "est"
...continues...

After N merges, "newest" might be: ["new", "est"]

Byte-Level BPE

GPT-2 and GPT-3 use byte-level BPE: start with 256 bytes (not characters) as the base vocabulary, ensuring any text is representable with no [UNK] token.

Python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2")

text = "The patient takes Warfarin 5mg daily"
tokens = tokenizer.encode(text)
decoded = [tokenizer.decode([t]) for t in tokens]

# Example output:
# ['The', ' patient', ' takes', ' War', 'far', 'in', ' ', '5', 'mg', ' daily']
# Note: spaces are part of the preceding token (GPT-2 convention)

SentencePiece and Unigram

LLaMA uses SentencePiece with a Unigram language model:

Unigram LM approach:
1. Start with a large vocabulary
2. Remove symbols that increase corpus log-likelihood least
3. Repeat until target vocabulary size

Differences from BPE:
  - Probabilistic segmentation (multiple segmentations possible with probabilities)
  - More principled vocabulary selection
  - Language-agnostic (no word boundaries needed — treats spaces as characters)

Why Tokenisation Matters

1. Context length efficiency:
   "antidisestablishmentarianism" = 1 word = 5 GPT-4 tokens
   Asian languages tokenise less efficiently (1 char ≈ 1 token)
   Same 4K context window = far fewer "words" for Chinese vs English

2. Arithmetic failures:
   "9.11" → ["9", ".", "1", "1"]  — model doesn't see the number as a unit
   Adding multi-digit numbers requires seeing digit structure via tokens

3. Prompt length costs:
   Every token costs compute and memory
   Long system prompts are expensive at scale

4. Medical vocabulary:
   "hydroxychloroquine" → 4-6 subword tokens
   Domain-specific tokenisers (e.g., ClinicalBERT) may segment more naturally

Interview Answer

"Tokenisation converts text into a sequence of integer IDs that the model processes. Byte-Pair Encoding builds a vocabulary by iteratively merging the most frequent character pair in the corpus, starting from bytes or characters and building up to common subwords. The vocabulary size is a hyperparameter (32K–100K). Byte-level BPE (GPT-2, GPT-3) uses all 256 byte values as the base, eliminating unknown tokens. LLaMA uses SentencePiece with a Unigram model. Tokenisation choices affect arithmetic performance, multilingual efficiency, context length utilisation, and medical/scientific vocabulary handling."

Enjoyed this article?

Explore the AI Systems learning path for more.

Found this helpful?

Share:𝕏

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.