Tokenization Deep Dive
BPE, WordPiece, SentencePiece — how tokenizers work, why vocabulary size matters, and the surprising impact of tokenization on model quality across languages.
Tokenization Deep Dive
Tokenization sits between raw text and model computation. It's often treated as a preprocessing detail, but tokenization choices have profound consequences: they determine what the model can represent, how efficiently it processes different languages, and whether arithmetic works. This article goes deep on how tokenizers work and why the choices matter.
1. Why Not Just Use Characters or Words?
Character-Level Tokenization
Pros: No out-of-vocabulary problem; handles any language or character. Cons: Sequences become very long. "Hello world" is 11 characters, but only 2 words. Longer sequences mean more computation (attention is quadratic in sequence length) and harder learning — the model must learn that "h", "e", "l", "l", "o" together mean greeting.
Word-Level Tokenization
Pros: Short sequences; intuitive.
Cons: Enormous vocabulary (English alone has 170,000+ words). Words like "unhappiness", "unhappy", "happily" appear as unrelated tokens — no morphological sharing. New words (proper nouns, slang, technical terms) become [UNK].
Subword Tokenization: The Sweet Spot
Subword methods split words into meaningful pieces: "unhappiness" → ["un", "happy", "ness"]. This balances sequence length against vocabulary coverage and handles morphology naturally.
2. Byte-Pair Encoding (BPE) — GPT Family
The Algorithm
BPE starts with a character vocabulary and iteratively merges the most frequent adjacent pair:
from collections import defaultdict
import re
def get_vocab(corpus: list[str]) -> dict:
"""Initialize vocab with character-level split + end-of-word marker"""
vocab = defaultdict(int)
for sentence in corpus:
for word in sentence.split():
# Add space before each word (GPT-2 convention)
chars = list(' ' + word) if vocab else list(word)
vocab[' '.join(list(word)) + ' </w>'] += 1
return dict(vocab)
def get_stats(vocab: dict) -> dict:
"""Count frequency of adjacent pairs"""
pairs = defaultdict(int)
for word, freq in vocab.items():
symbols = word.split()
for i in range(len(symbols) - 1):
pairs[(symbols[i], symbols[i+1])] += freq
return dict(pairs)
def merge_vocab(pair: tuple, vocab: dict) -> dict:
"""Merge the most frequent pair in all vocab entries"""
new_vocab = {}
bigram = ' '.join(pair)
replacement = ''.join(pair)
for word in vocab:
new_word = word.replace(bigram, replacement)
new_vocab[new_word] = vocab[word]
return new_vocab
def train_bpe(corpus: list[str], num_merges: int) -> list[tuple]:
vocab = get_vocab(corpus)
merges = []
for i in range(num_merges):
pairs = get_stats(vocab)
if not pairs:
break
best_pair = max(pairs, key=pairs.get)
vocab = merge_vocab(best_pair, vocab)
merges.append(best_pair)
print(f"Merge {i+1}: {best_pair} -> {''.join(best_pair)}")
return merges
# Example
corpus = [
"low lower lowest",
"new newer newest",
"low new lower newer lowest newest",
]
merges = train_bpe(corpus, num_merges=10)GPT-2's Byte-Level BPE
GPT-2 took BPE one step further: it operates on raw bytes, not Unicode characters. Every possible byte (0-255) is in the base vocabulary. This means:
- No unknown tokens ever (any byte sequence is representable)
- Handles all languages, emoji, code, binary data
- No need for preprocessing or normalization
# GPT-2 uses a mapping from raw bytes to unicode "safe" characters
# This allows BPE to operate on text without special handling
def bytes_to_unicode():
"""
Returns list of utf-8 byte values and unicode string mappings.
GPT-2 maps raw bytes to printable unicode characters for BPE.
"""
bs = list(range(ord('!'), ord('~') + 1)) + \
list(range(ord('¡'), ord('¬') + 1)) + \
list(range(ord('®'), ord('ÿ') + 1))
cs = bs[:]
n = 0
for b in range(256):
if b not in bs:
bs.append(b)
cs.append(256 + n)
n += 1
return dict(zip(bs, [chr(c) for c in cs]))
# The result: every byte maps to a printable character
# BPE then operates on these "safe" characters
byte_map = bytes_to_unicode()
print(f"Total byte mappings: {len(byte_map)}") # 2563. WordPiece — BERT
How WordPiece Differs from BPE
WordPiece also builds subword units iteratively, but instead of merging the most frequent pair, it merges the pair that maximizes the likelihood of the training data:
score(A, B) = freq(AB) / (freq(A) * freq(B))This is equivalent to pointwise mutual information. WordPiece prefers pairs that appear together more than expected by chance.
Key difference: WordPiece uses ## to mark continuation tokens:
- "playing" → ["play", "##ing"]
- "unbelievable" → ["un", "##believ", "##able"]
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
texts = [
"The transformer architecture revolutionized NLP.",
"Tokenization affects model performance significantly.",
"Antidisestablishmentarianism is a long word.",
]
for text in texts:
tokens = tokenizer.tokenize(text)
ids = tokenizer.encode(text)
print(f"Text: {text[:50]}")
print(f"Tokens ({len(tokens)}): {tokens}")
print(f"Token IDs: {ids[:10]}...")
print()The [CLS] and [SEP] Convention
BERT adds special tokens:
[CLS]at the start — its representation is used for classification[SEP]between and after sequences[MASK]for masked language modeling targets
These are vocabulary entries 0, 102, 103 in BERT's vocabulary.
4. SentencePiece — T5, LLaMA, Gemma
Unigram Language Model Tokenization
SentencePiece (Kudo & Richardson, 2018) provides two algorithms: BPE (as above) and Unigram LM. Unigram LM:
- Starts with a large candidate vocabulary
- Removes tokens that minimize the loss in a unigram language model
- Repeats until the target vocabulary size is reached
The unigram approach allows probabilistic tokenization — a word might be tokenized differently in different contexts.
Why SentencePiece for Multilingual Models
SentencePiece operates directly on raw Unicode without language-specific preprocessing (no word splitting required). This makes it naturally language-agnostic.
import sentencepiece as spm
import io
def train_sentencepiece_tokenizer(
corpus_file: str,
model_prefix: str,
vocab_size: int = 32000,
model_type: str = "bpe", # or "unigram"
):
spm.SentencePieceTrainer.train(
input=corpus_file,
model_prefix=model_prefix,
vocab_size=vocab_size,
model_type=model_type,
character_coverage=0.9995, # for multilingual: 0.9995
pad_id=0,
unk_id=1,
bos_id=2, # beginning of sentence
eos_id=3, # end of sentence
# LLaMA-style: byte fallback for unknown chars
byte_fallback=True,
normalization_rule_name="identity", # no NFKC normalization
)
def load_and_tokenize(model_file: str, texts: list[str]):
sp = spm.SentencePieceProcessor()
sp.load(model_file)
for text in texts:
pieces = sp.encode(text, out_type=str)
ids = sp.encode(text, out_type=int)
print(f"Text: {text[:60]}")
print(f"Pieces: {pieces}")
print(f"IDs: {ids}")
print(f"Decoded: {sp.decode(ids)}")
print()The Space/Underscore Token Mystery
In SentencePiece, spaces are represented as ▁ (U+2581, LOWER ONE EIGHTH BLOCK). This is why LLaMA tokenizer outputs look like:
["▁The", "▁cat", "▁sat", "▁on", "▁the", "▁mat"]The ▁ prefix means "this token starts a new word." This encodes word boundaries without needing explicit whitespace tokens. When you decode, ▁ is replaced with a space.
5. Vocabulary Size: Impact on Model Quality
The Tradeoff
| Vocabulary Size | Sequence Length | Coverage | Parameter Cost | |----------------|-----------------|----------|----------------| | 8,000 | Long | Low | Small | | 32,000 | Medium | Good | Medium | | 50,257 | Short | Excellent | Large | | 100,000+ | Shorter | Excellent | Very Large |
The embedding matrix has shape (vocab_size, d_model). For GPT-3 (d_model=12288, vocab=50,257), that's 617M parameters in embeddings alone — about 35% of the total. Larger vocabularies consume more memory for embeddings.
Typical Vocabulary Sizes
| Model | Tokenizer | Vocab Size | |-------|-----------|------------| | BERT | WordPiece | 30,522 | | GPT-2 | BPE | 50,257 | | GPT-3 | BPE | 50,257 | | GPT-4 (cl100k) | BPE | 100,277 | | T5 | SentencePiece | 32,100 | | LLaMA | SentencePiece | 32,000 | | LLaMA 3 | BPE (tiktoken) | 128,256 | | Gemma | SentencePiece | 256,128 |
Why GPT-4 Doubled the Vocabulary
The cl100k_base tokenizer (GPT-4, GPT-3.5-turbo) has 100,277 tokens vs 50,257 for GPT-2. The larger vocabulary:
- Reduces sequence length (more efficient attention)
- Better handles code (common identifiers are single tokens)
- Better handles non-English languages
- Treats numbers differently — each digit is its own token
6. Fertility: Tokens Per Word Across Languages
Fertility measures how many tokens are needed to encode a word on average. Lower is better — it means shorter sequences and more efficient computation.
from transformers import AutoTokenizer
def measure_fertility(text_samples: dict[str, str], tokenizer_name: str):
"""
Measure tokens-per-word ratio across languages.
Lower = more efficient = better for that language.
"""
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
results = {}
for lang, text in text_samples.items():
words = len(text.split())
tokens = tokenizer.encode(text, add_special_tokens=False)
fertility = len(tokens) / max(words, 1)
results[lang] = {
"words": words,
"tokens": len(tokens),
"fertility": round(fertility, 2),
}
return results
# GPT-2 tokenizer is heavily biased toward English
text_samples = {
"English": "The quick brown fox jumps over the lazy dog",
"French": "Le renard brun rapide saute par-dessus le chien paresseux",
"German": "Der schnelle braune Fuchs springt über den faulen Hund",
"Arabic": "الثعلب البني السريع يقفز فوق الكلب الكسول",
"Chinese": "快速的棕色狐狸跳过懒狗",
"Hindi": "तेज़ भूरी लोमड़ी आलसी कुत्ते के ऊपर कूदती है",
"Python": "def fibonacci(n): return n if n <= 1 else fibonacci(n-1) + fibonacci(n-2)",
}
# Expected results with GPT-2 tokenizer (cl100k similar):
# English: ~1.3 tokens/word (near-optimal for GPT-2)
# French: ~1.5-1.8 tokens/word
# German: ~1.4-1.6 tokens/word
# Arabic: ~4-6 tokens/word (severe fragmentation)
# Chinese: ~1.5-2.5 tokens/word (characters map to multiple tokens)
# Hindi: ~4-8 tokens/word (Devanagari severely underrepresented)Practical Consequence
A model with an English-optimized tokenizer "wastes" tokens on non-English text:
- An Arabic document takes 5x as many tokens as the equivalent English text
- This means 5x less content fits in the context window
- 5x more computation per document
- Effectively worse performance for Arabic speakers
Multilingual models like mBERT, mT5, and BLOOM use vocabulary sizes of 100K-250K to improve fertility across languages.
7. How Tokenization Affects Arithmetic
One of the most surprising effects: GPT-2's tokenizer makes arithmetic harder.
from transformers import GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
numbers_to_test = ["123", "1234", "12345", "9999", "10000", "99999"]
for num in numbers_to_test:
tokens = tokenizer.tokenize(num)
print(f"{num:>8} → {tokens}")
# Output (approximate):
# 123 → ['123'] — 1 token
# 1234 → ['1234'] — 1 token
# 12345 → ['123', '45'] — 2 tokens (split!)
# 9999 → ['9999'] — 1 token
# 10000 → ['10000'] — 1 token
# 99999 → ['999', '99'] — 2 tokens (split differently!)GPT-2's tokenizer learned common number patterns from text, so "1234" is one token but "12345" might be split. This inconsistency makes arithmetic unpredictable — the model must learn that "123" + "45" in token space = "12345".
GPT-4's cl100k_base tokenizer was designed so each digit is its own token: "12345" → ["1", "2", "3", "4", "5"]. This is less efficient for simple number display but makes arithmetic more reliable.
8. Special Tokens and Chat Templates
Modern instruction-tuned models use special tokens to structure conversations:
# LLaMA 3 chat template
LLAMA3_TEMPLATE = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>
{system_prompt}<|eot_id|><|start_header_id|>user<|end_header_id|>
{user_message}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""
# Mistral instruct template
MISTRAL_TEMPLATE = "<s>[INST] {user_message} [/INST]"
# ChatML format (used by OpenAI, many others)
CHATML_TEMPLATE = """<|im_start|>system
{system_prompt}<|im_end|>
<|im_start|>user
{user_message}<|im_end|>
<|im_start|>assistant
"""
from transformers import AutoTokenizer
def apply_chat_template_example(model_name: str = "meta-llama/Meta-Llama-3-8B-Instruct"):
"""
Modern HuggingFace tokenizers have apply_chat_template built in.
This handles the correct format for each model automatically.
"""
tokenizer = AutoTokenizer.from_pretrained(model_name)
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"},
]
# apply_chat_template handles model-specific formatting
prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True # adds the assistant turn header
)
print(prompt)
token_ids = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt"
)
print(f"Token count: {token_ids.shape[1]}")9. Tokenization Gotchas for Engineers
1. Leading Space Matters
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
print(tokenizer.tokenize("cat")) # ['cat']
print(tokenizer.tokenize(" cat")) # ['Ġcat'] — different token!Ġ (G with dot above) represents a leading space in GPT-2. Prompting with or without a leading space changes which token is generated first.
2. Case Sensitivity
BERT-base-uncased lowercases everything. BERT-base-cased does not. Using the wrong variant for case-sensitive tasks (named entity recognition) hurts performance.
3. Truncation vs. Chunking
When your text exceeds the maximum sequence length:
def chunk_text_for_processing(
text: str,
tokenizer,
max_length: int = 512,
stride: int = 128 # overlap for context continuity
) -> list[dict]:
"""
Sliding window chunking with stride for long documents.
Used in extractive QA (SQuAD-style) to handle long contexts.
"""
tokens = tokenizer(
text,
return_overflowing_tokens=True,
max_length=max_length,
stride=stride,
truncation=True,
padding="max_length",
return_tensors="pt"
)
return tokens4. Token Count Estimation for APIs
When calling OpenAI or Anthropic APIs, knowing token counts beforehand helps estimate costs:
import tiktoken
def count_tokens_openai(text: str, model: str = "gpt-4o") -> int:
"""Count tokens for OpenAI models using tiktoken"""
enc = tiktoken.encoding_for_model(model)
return len(enc.encode(text))
def estimate_api_cost(
prompt: str,
completion_tokens: int,
model: str = "gpt-4o",
input_cost_per_1k: float = 0.005,
output_cost_per_1k: float = 0.015,
) -> dict:
input_tokens = count_tokens_openai(prompt, model)
input_cost = (input_tokens / 1000) * input_cost_per_1k
output_cost = (completion_tokens / 1000) * output_cost_per_1k
return {
"input_tokens": input_tokens,
"completion_tokens": completion_tokens,
"total_tokens": input_tokens + completion_tokens,
"estimated_cost_usd": round(input_cost + output_cost, 6),
}Summary
Tokenization is not a neutral preprocessing step — it encodes assumptions about language, shapes what the model can learn, and determines efficiency across languages. BPE (GPT), WordPiece (BERT), and SentencePiece (LLaMA, T5) each make different tradeoffs. The vocabulary size determines memory cost and sequence efficiency. Fertility metrics reveal which languages are first-class citizens in a model's world. Engineers working with LLMs need to understand tokenization to debug unexpected behavior, estimate costs accurately, and design prompts effectively.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.