Tokenization Deep Dive

Tokenization sits between raw text and model computation. It's often treated as a preprocessing detail, but tokenization choices have profound consequences: they determine what the model can represent, how efficiently it processes different languages, and whether arithmetic works. This article goes deep on how tokenizers work and why the choices matter.

1. Why Not Just Use Characters or Words?

Character-Level Tokenization

Pros: No out-of-vocabulary problem; handles any language or character. Cons: Sequences become very long. "Hello world" is 11 characters, but only 2 words. Longer sequences mean more computation (attention is quadratic in sequence length) and harder learning — the model must learn that "h", "e", "l", "l", "o" together mean greeting.

Word-Level Tokenization

Pros: Short sequences; intuitive. Cons: Enormous vocabulary (English alone has 170,000+ words). Words like "unhappiness", "unhappy", "happily" appear as unrelated tokens — no morphological sharing. New words (proper nouns, slang, technical terms) become [UNK].

Subword Tokenization: The Sweet Spot

Subword methods split words into meaningful pieces: "unhappiness" → ["un", "happy", "ness"]. This balances sequence length against vocabulary coverage and handles morphology naturally.

2. Byte-Pair Encoding (BPE) — GPT Family

The Algorithm

BPE starts with a character vocabulary and iteratively merges the most frequent adjacent pair:

Python

from collections import defaultdict
import re

def get_vocab(corpus: list[str]) -> dict:
    """Initialize vocab with character-level split + end-of-word marker"""
    vocab = defaultdict(int)
    for sentence in corpus:
        for word in sentence.split():
            # Add space before each word (GPT-2 convention)
            chars = list(' ' + word) if vocab else list(word)
            vocab[' '.join(list(word)) + ' </w>'] += 1
    return dict(vocab)

def get_stats(vocab: dict) -> dict:
    """Count frequency of adjacent pairs"""
    pairs = defaultdict(int)
    for word, freq in vocab.items():
        symbols = word.split()
        for i in range(len(symbols) - 1):
            pairs[(symbols[i], symbols[i+1])] += freq
    return dict(pairs)

def merge_vocab(pair: tuple, vocab: dict) -> dict:
    """Merge the most frequent pair in all vocab entries"""
    new_vocab = {}
    bigram = ' '.join(pair)
    replacement = ''.join(pair)
    for word in vocab:
        new_word = word.replace(bigram, replacement)
        new_vocab[new_word] = vocab[word]
    return new_vocab

def train_bpe(corpus: list[str], num_merges: int) -> list[tuple]:
    vocab = get_vocab(corpus)
    merges = []

    for i in range(num_merges):
        pairs = get_stats(vocab)
        if not pairs:
            break
        best_pair = max(pairs, key=pairs.get)
        vocab = merge_vocab(best_pair, vocab)
        merges.append(best_pair)
        print(f"Merge {i+1}: {best_pair} -> {''.join(best_pair)}")

    return merges

# Example
corpus = [
    "low lower lowest",
    "new newer newest",
    "low new lower newer lowest newest",
]
merges = train_bpe(corpus, num_merges=10)

GPT-2's Byte-Level BPE

GPT-2 took BPE one step further: it operates on raw bytes, not Unicode characters. Every possible byte (0-255) is in the base vocabulary. This means:

No unknown tokens ever (any byte sequence is representable)
Handles all languages, emoji, code, binary data
No need for preprocessing or normalization

Python

# GPT-2 uses a mapping from raw bytes to unicode "safe" characters
# This allows BPE to operate on text without special handling

def bytes_to_unicode():
    """
    Returns list of utf-8 byte values and unicode string mappings.
    GPT-2 maps raw bytes to printable unicode characters for BPE.
    """
    bs = list(range(ord('!'), ord('~') + 1)) + \
         list(range(ord('¡'), ord('¬') + 1)) + \
         list(range(ord('®'), ord('ÿ') + 1))
    cs = bs[:]
    n = 0
    for b in range(256):
        if b not in bs:
            bs.append(b)
            cs.append(256 + n)
            n += 1
    return dict(zip(bs, [chr(c) for c in cs]))

# The result: every byte maps to a printable character
# BPE then operates on these "safe" characters
byte_map = bytes_to_unicode()
print(f"Total byte mappings: {len(byte_map)}")  # 256

3. WordPiece — BERT

How WordPiece Differs from BPE

WordPiece also builds subword units iteratively, but instead of merging the most frequent pair, it merges the pair that maximizes the likelihood of the training data:

score(A, B) = freq(AB) / (freq(A) * freq(B))

This is equivalent to pointwise mutual information. WordPiece prefers pairs that appear together more than expected by chance.

Key difference: WordPiece uses ## to mark continuation tokens:

"playing" → ["play", "##ing"]
"unbelievable" → ["un", "##believ", "##able"]

Python

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

texts = [
    "The transformer architecture revolutionized NLP.",
    "Tokenization affects model performance significantly.",
    "Antidisestablishmentarianism is a long word.",
]

for text in texts:
    tokens = tokenizer.tokenize(text)
    ids = tokenizer.encode(text)
    print(f"Text: {text[:50]}")
    print(f"Tokens ({len(tokens)}): {tokens}")
    print(f"Token IDs: {ids[:10]}...")
    print()

The `[CLS]` and `[SEP]` Convention

BERT adds special tokens:

[CLS] at the start — its representation is used for classification
[SEP] between and after sequences
[MASK] for masked language modeling targets

These are vocabulary entries 0, 102, 103 in BERT's vocabulary.

4. SentencePiece — T5, LLaMA, Gemma

Unigram Language Model Tokenization

SentencePiece (Kudo & Richardson, 2018) provides two algorithms: BPE (as above) and Unigram LM. Unigram LM:

Starts with a large candidate vocabulary
Removes tokens that minimize the loss in a unigram language model
Repeats until the target vocabulary size is reached

The unigram approach allows probabilistic tokenization — a word might be tokenized differently in different contexts.

Why SentencePiece for Multilingual Models

SentencePiece operates directly on raw Unicode without language-specific preprocessing (no word splitting required). This makes it naturally language-agnostic.

Python

import sentencepiece as spm
import io

def train_sentencepiece_tokenizer(
    corpus_file: str,
    model_prefix: str,
    vocab_size: int = 32000,
    model_type: str = "bpe",  # or "unigram"
):
    spm.SentencePieceTrainer.train(
        input=corpus_file,
        model_prefix=model_prefix,
        vocab_size=vocab_size,
        model_type=model_type,
        character_coverage=0.9995,  # for multilingual: 0.9995
        pad_id=0,
        unk_id=1,
        bos_id=2,  # beginning of sentence
        eos_id=3,  # end of sentence
        # LLaMA-style: byte fallback for unknown chars
        byte_fallback=True,
        normalization_rule_name="identity",  # no NFKC normalization
    )

def load_and_tokenize(model_file: str, texts: list[str]):
    sp = spm.SentencePieceProcessor()
    sp.load(model_file)

    for text in texts:
        pieces = sp.encode(text, out_type=str)
        ids = sp.encode(text, out_type=int)
        print(f"Text: {text[:60]}")
        print(f"Pieces: {pieces}")
        print(f"IDs: {ids}")
        print(f"Decoded: {sp.decode(ids)}")
        print()

The Space/Underscore Token Mystery

In SentencePiece, spaces are represented as ▁ (U+2581, LOWER ONE EIGHTH BLOCK). This is why LLaMA tokenizer outputs look like:

["▁The", "▁cat", "▁sat", "▁on", "▁the", "▁mat"]

The ▁ prefix means "this token starts a new word." This encodes word boundaries without needing explicit whitespace tokens. When you decode, ▁ is replaced with a space.

5. Vocabulary Size: Impact on Model Quality

The Tradeoff

| Vocabulary Size | Sequence Length | Coverage | Parameter Cost | |----------------|-----------------|----------|----------------| | 8,000 | Long | Low | Small | | 32,000 | Medium | Good | Medium | | 50,257 | Short | Excellent | Large | | 100,000+ | Shorter | Excellent | Very Large |

The embedding matrix has shape (vocab_size, d_model). For GPT-3 (d_model=12288, vocab=50,257), that's 617M parameters in embeddings alone — about 35% of the total. Larger vocabularies consume more memory for embeddings.

Typical Vocabulary Sizes

| Model | Tokenizer | Vocab Size | |-------|-----------|------------| | BERT | WordPiece | 30,522 | | GPT-2 | BPE | 50,257 | | GPT-3 | BPE | 50,257 | | GPT-4 (cl100k) | BPE | 100,277 | | T5 | SentencePiece | 32,100 | | LLaMA | SentencePiece | 32,000 | | LLaMA 3 | BPE (tiktoken) | 128,256 | | Gemma | SentencePiece | 256,128 |

Why GPT-4 Doubled the Vocabulary

The cl100k_base tokenizer (GPT-4, GPT-3.5-turbo) has 100,277 tokens vs 50,257 for GPT-2. The larger vocabulary:

Reduces sequence length (more efficient attention)
Better handles code (common identifiers are single tokens)
Better handles non-English languages
Treats numbers differently — each digit is its own token

6. Fertility: Tokens Per Word Across Languages

Fertility measures how many tokens are needed to encode a word on average. Lower is better — it means shorter sequences and more efficient computation.

Python

from transformers import AutoTokenizer

def measure_fertility(text_samples: dict[str, str], tokenizer_name: str):
    """
    Measure tokens-per-word ratio across languages.
    Lower = more efficient = better for that language.
    """
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)

    results = {}
    for lang, text in text_samples.items():
        words = len(text.split())
        tokens = tokenizer.encode(text, add_special_tokens=False)
        fertility = len(tokens) / max(words, 1)
        results[lang] = {
            "words": words,
            "tokens": len(tokens),
            "fertility": round(fertility, 2),
        }
    return results

# GPT-2 tokenizer is heavily biased toward English
text_samples = {
    "English":    "The quick brown fox jumps over the lazy dog",
    "French":     "Le renard brun rapide saute par-dessus le chien paresseux",
    "German":     "Der schnelle braune Fuchs springt über den faulen Hund",
    "Arabic":     "الثعلب البني السريع يقفز فوق الكلب الكسول",
    "Chinese":    "快速的棕色狐狸跳过懒狗",
    "Hindi":      "तेज़ भूरी लोमड़ी आलसी कुत्ते के ऊपर कूदती है",
    "Python":     "def fibonacci(n): return n if n <= 1 else fibonacci(n-1) + fibonacci(n-2)",
}

# Expected results with GPT-2 tokenizer (cl100k similar):
# English: ~1.3 tokens/word (near-optimal for GPT-2)
# French: ~1.5-1.8 tokens/word
# German: ~1.4-1.6 tokens/word
# Arabic: ~4-6 tokens/word (severe fragmentation)
# Chinese: ~1.5-2.5 tokens/word (characters map to multiple tokens)
# Hindi: ~4-8 tokens/word (Devanagari severely underrepresented)

Practical Consequence

A model with an English-optimized tokenizer "wastes" tokens on non-English text:

An Arabic document takes 5x as many tokens as the equivalent English text
This means 5x less content fits in the context window
5x more computation per document
Effectively worse performance for Arabic speakers

Multilingual models like mBERT, mT5, and BLOOM use vocabulary sizes of 100K-250K to improve fertility across languages.

7. How Tokenization Affects Arithmetic

One of the most surprising effects: GPT-2's tokenizer makes arithmetic harder.

Python

from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

numbers_to_test = ["123", "1234", "12345", "9999", "10000", "99999"]
for num in numbers_to_test:
    tokens = tokenizer.tokenize(num)
    print(f"{num:>8} → {tokens}")

# Output (approximate):
# 123    → ['123']          — 1 token
# 1234   → ['1234']         — 1 token
# 12345  → ['123', '45']    — 2 tokens (split!)
# 9999   → ['9999']         — 1 token
# 10000  → ['10000']        — 1 token
# 99999  → ['999', '99']    — 2 tokens (split differently!)

GPT-2's tokenizer learned common number patterns from text, so "1234" is one token but "12345" might be split. This inconsistency makes arithmetic unpredictable — the model must learn that "123" + "45" in token space = "12345".

GPT-4's cl100k_base tokenizer was designed so each digit is its own token: "12345" → ["1", "2", "3", "4", "5"]. This is less efficient for simple number display but makes arithmetic more reliable.

8. Special Tokens and Chat Templates

Modern instruction-tuned models use special tokens to structure conversations:

Python

# LLaMA 3 chat template
LLAMA3_TEMPLATE = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>

{system_prompt}<|eot_id|><|start_header_id|>user<|end_header_id|>

{user_message}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

"""

# Mistral instruct template
MISTRAL_TEMPLATE = "<s>[INST] {user_message} [/INST]"

# ChatML format (used by OpenAI, many others)
CHATML_TEMPLATE = """<|im_start|>system
{system_prompt}<|im_end|>
<|im_start|>user
{user_message}<|im_end|>
<|im_start|>assistant
"""

from transformers import AutoTokenizer

def apply_chat_template_example(model_name: str = "meta-llama/Meta-Llama-3-8B-Instruct"):
    """
    Modern HuggingFace tokenizers have apply_chat_template built in.
    This handles the correct format for each model automatically.
    """
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the capital of France?"},
    ]

    # apply_chat_template handles model-specific formatting
    prompt = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True  # adds the assistant turn header
    )
    print(prompt)

    token_ids = tokenizer.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,
        return_tensors="pt"
    )
    print(f"Token count: {token_ids.shape[1]}")

9. Tokenization Gotchas for Engineers

1. Leading Space Matters

Python

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
print(tokenizer.tokenize("cat"))    # ['cat']
print(tokenizer.tokenize(" cat"))   # ['Ġcat'] — different token!

Ġ (G with dot above) represents a leading space in GPT-2. Prompting with or without a leading space changes which token is generated first.

2. Case Sensitivity

BERT-base-uncased lowercases everything. BERT-base-cased does not. Using the wrong variant for case-sensitive tasks (named entity recognition) hurts performance.

3. Truncation vs. Chunking

When your text exceeds the maximum sequence length:

Python

def chunk_text_for_processing(
    text: str,
    tokenizer,
    max_length: int = 512,
    stride: int = 128  # overlap for context continuity
) -> list[dict]:
    """
    Sliding window chunking with stride for long documents.
    Used in extractive QA (SQuAD-style) to handle long contexts.
    """
    tokens = tokenizer(
        text,
        return_overflowing_tokens=True,
        max_length=max_length,
        stride=stride,
        truncation=True,
        padding="max_length",
        return_tensors="pt"
    )
    return tokens

4. Token Count Estimation for APIs

When calling OpenAI or Anthropic APIs, knowing token counts beforehand helps estimate costs:

Python

import tiktoken

def count_tokens_openai(text: str, model: str = "gpt-4o") -> int:
    """Count tokens for OpenAI models using tiktoken"""
    enc = tiktoken.encoding_for_model(model)
    return len(enc.encode(text))

def estimate_api_cost(
    prompt: str,
    completion_tokens: int,
    model: str = "gpt-4o",
    input_cost_per_1k: float = 0.005,
    output_cost_per_1k: float = 0.015,
) -> dict:
    input_tokens = count_tokens_openai(prompt, model)
    input_cost = (input_tokens / 1000) * input_cost_per_1k
    output_cost = (completion_tokens / 1000) * output_cost_per_1k

    return {
        "input_tokens": input_tokens,
        "completion_tokens": completion_tokens,
        "total_tokens": input_tokens + completion_tokens,
        "estimated_cost_usd": round(input_cost + output_cost, 6),
    }

Summary

Tokenization is not a neutral preprocessing step — it encodes assumptions about language, shapes what the model can learn, and determines efficiency across languages. BPE (GPT), WordPiece (BERT), and SentencePiece (LLaMA, T5) each make different tradeoffs. The vocabulary size determines memory cost and sequence efficiency. Fertility metrics reveal which languages are first-class citizens in a model's world. Engineers working with LLMs need to understand tokenization to debug unexpected behavior, estimate costs accurately, and design prompts effectively.

Tokenization Deep Dive

Tokenization Deep Dive

1. Why Not Just Use Characters or Words?

Character-Level Tokenization

Word-Level Tokenization

Subword Tokenization: The Sweet Spot

2. Byte-Pair Encoding (BPE) — GPT Family

The Algorithm

GPT-2's Byte-Level BPE

3. WordPiece — BERT

How WordPiece Differs from BPE

The `[CLS]` and `[SEP]` Convention

4. SentencePiece — T5, LLaMA, Gemma

Unigram Language Model Tokenization

Why SentencePiece for Multilingual Models

The Space/Underscore Token Mystery

5. Vocabulary Size: Impact on Model Quality

The Tradeoff

Typical Vocabulary Sizes

Why GPT-4 Doubled the Vocabulary

6. Fertility: Tokens Per Word Across Languages

Practical Consequence

7. How Tokenization Affects Arithmetic

8. Special Tokens and Chat Templates

9. Tokenization Gotchas for Engineers

1. Leading Space Matters

2. Case Sensitivity

3. Truncation vs. Chunking

4. Token Count Estimation for APIs

Summary

Enjoyed this article?

Leave a comment

Tokenization Deep Dive

1. Why Not Just Use Characters or Words?

Character-Level Tokenization

Word-Level Tokenization

Subword Tokenization: The Sweet Spot

2. Byte-Pair Encoding (BPE) — GPT Family

The Algorithm

GPT-2's Byte-Level BPE

3. WordPiece — BERT

How WordPiece Differs from BPE

The [CLS] and [SEP] Convention

4. SentencePiece — T5, LLaMA, Gemma

Unigram Language Model Tokenization

Why SentencePiece for Multilingual Models

The Space/Underscore Token Mystery

5. Vocabulary Size: Impact on Model Quality

The Tradeoff

Typical Vocabulary Sizes

Why GPT-4 Doubled the Vocabulary

6. Fertility: Tokens Per Word Across Languages

Practical Consequence

7. How Tokenization Affects Arithmetic

8. Special Tokens and Chat Templates

9. Tokenization Gotchas for Engineers

1. Leading Space Matters

2. Case Sensitivity

3. Truncation vs. Chunking

4. Token Count Estimation for APIs

Summary

Enjoyed this article?

Leave a comment

The `[CLS]` and `[SEP]` Convention