Tokenisation and Byte-Pair Encoding

Why Tokenise?

LLMs don't operate on characters or words — they operate on tokens: subword units that balance vocabulary size with representation efficiency:

Too fine (characters):
  "Warfarin" → ['W', 'a', 'r', 'f', 'a', 'r', 'i', 'n']
  8 tokens for one word — too long, no morphological structure

Too coarse (words):
  "Warfarins" → unknown if not in vocabulary
  Huge vocabulary, many rare words get [UNK]

Subword (BPE):
  "Warfarin" → ['War', 'far', 'in']  or  ['Warf', 'arin']
  ~3 tokens, handles rare words, finite vocabulary

Byte-Pair Encoding (BPE)

BPE is a data compression algorithm repurposed for subword segmentation:

Training algorithm:
1. Start with character-level vocabulary + end-of-word markers
   {"W", "a", "r", "f", "i", "n", " " ...}

2. Count all adjacent symbol pairs in the corpus

3. Merge the most frequent pair into a new symbol
   "t" + "h" → "th"  (if "th" is most frequent pair)

4. Repeat for N merge operations (N = vocab size - base chars)

5. Final vocabulary = base chars + N merged symbols

The vocabulary size is a hyperparameter (e.g., 32K for LLaMA, 50K for GPT-2, 100K for LLaMA 3).

BPE Training Example

Corpus (toy): "low low low lower lower newest newest widest"

Initial splits with end marker:
  l-o-w-: 5, l-o-w-e-r-: 2, n-e-w-e-s-t-: 6, w-i-d-e-s-t-: 3

Most frequent pair: (e, s) → merge to "es"
  n-e-w-es-t-: 6, w-i-d-es-t-: 3

Most frequent pair: (es, t) → merge to "est"
  n-e-w-est-: 6, w-i-d-est-: 3

Most frequent pair: (est, ) → merge to "est"
...continues...

After N merges, "newest" might be: ["new", "est"]

Byte-Level BPE

GPT-2 and GPT-3 use byte-level BPE: start with 256 bytes (not characters) as the base vocabulary, ensuring any text is representable with no [UNK] token.

Python

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2")

text = "The patient takes Warfarin 5mg daily"
tokens = tokenizer.encode(text)
decoded = [tokenizer.decode([t]) for t in tokens]

# Example output:
# ['The', ' patient', ' takes', ' War', 'far', 'in', ' ', '5', 'mg', ' daily']
# Note: spaces are part of the preceding token (GPT-2 convention)

SentencePiece and Unigram

LLaMA uses SentencePiece with a Unigram language model:

Unigram LM approach:
1. Start with a large vocabulary
2. Remove symbols that increase corpus log-likelihood least
3. Repeat until target vocabulary size

Differences from BPE:
  - Probabilistic segmentation (multiple segmentations possible with probabilities)
  - More principled vocabulary selection
  - Language-agnostic (no word boundaries needed — treats spaces as characters)

Why Tokenisation Matters

1. Context length efficiency:
   "antidisestablishmentarianism" = 1 word = 5 GPT-4 tokens
   Asian languages tokenise less efficiently (1 char ≈ 1 token)
   Same 4K context window = far fewer "words" for Chinese vs English

2. Arithmetic failures:
   "9.11" → ["9", ".", "1", "1"]  — model doesn't see the number as a unit
   Adding multi-digit numbers requires seeing digit structure via tokens

3. Prompt length costs:
   Every token costs compute and memory
   Long system prompts are expensive at scale

4. Medical vocabulary:
   "hydroxychloroquine" → 4-6 subword tokens
   Domain-specific tokenisers (e.g., ClinicalBERT) may segment more naturally

Interview Answer

"Tokenisation converts text into a sequence of integer IDs that the model processes. Byte-Pair Encoding builds a vocabulary by iteratively merging the most frequent character pair in the corpus, starting from bytes or characters and building up to common subwords. The vocabulary size is a hyperparameter (32K–100K). Byte-level BPE (GPT-2, GPT-3) uses all 256 byte values as the base, eliminating unknown tokens. LLaMA uses SentencePiece with a Unigram model. Tokenisation choices affect arithmetic performance, multilingual efficiency, context length utilisation, and medical/scientific vocabulary handling."