Tokenization: From Text to Tokens

Why Tokenization?

Transformers operate on discrete integer sequences, not raw text. Tokenization maps text → a sequence of integer IDs from a fixed vocabulary. The vocabulary size (typically 32k–100k tokens) is a hyperparameter that trades off:

Larger vocabulary: Fewer tokens per sentence (faster inference, more context), more embedding parameters, better handling of rare words
Smaller vocabulary: More tokens per sentence, fewer parameters, more OOV (out-of-vocabulary) risk

Byte Pair Encoding (BPE)

BPE is the algorithm behind GPT-2/3/4, LLaMA, and Mistral tokenizers.

Training algorithm:

Start with a character-level vocabulary (all unique characters in the corpus)
Count all adjacent pair frequencies
Merge the most frequent pair into a new token
Repeat until vocabulary reaches target size

Python

def train_bpe(corpus: str, vocab_size: int) -> list[tuple[str, str]]:
    """Simplified BPE training — returns list of merge rules."""
    # Initialize: each character is a token, words split into chars + </w>
    vocab = {}
    for word in corpus.split():
        chars = tuple(list(word) + ["</w>"])
        vocab[chars] = vocab.get(chars, 0) + 1

    merges = []

    for _ in range(vocab_size):
        # Count adjacent pairs
        pair_counts = {}
        for word_tokens, count in vocab.items():
            for i in range(len(word_tokens) - 1):
                pair = (word_tokens[i], word_tokens[i + 1])
                pair_counts[pair] = pair_counts.get(pair, 0) + count

        if not pair_counts:
            break

        # Find and merge most frequent pair
        best_pair = max(pair_counts, key=pair_counts.get)
        merges.append(best_pair)

        # Apply merge to all words
        new_vocab = {}
        merged = "".join(best_pair)
        for word_tokens, count in vocab.items():
            new_tokens = []
            i = 0
            while i < len(word_tokens):
                if i < len(word_tokens) - 1 and (word_tokens[i], word_tokens[i+1]) == best_pair:
                    new_tokens.append(merged)
                    i += 2
                else:
                    new_tokens.append(word_tokens[i])
                    i += 1
            new_vocab[tuple(new_tokens)] = count
        vocab = new_vocab

    return merges

Using HuggingFace Tokenizers

Python

from transformers import AutoTokenizer

# Load a pretrained tokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3-8B")

text = "Warfarin inhibits VKOR enzyme, reducing vitamin K-dependent clotting."

# Encode: text → token IDs
tokens = tokenizer.encode(text)
print(f"Token IDs: {tokens}")
print(f"Token count: {len(tokens)}")

# Decode back
decoded = tokenizer.decode(tokens)
print(f"Decoded: {decoded}")

# See the actual token strings
token_strings = tokenizer.convert_ids_to_tokens(tokens)
print(f"Tokens: {token_strings}")
# ['▁War', 'farin', '▁inhibits', '▁V', 'KOR', '▁enzyme', ...]

# Special tokens
print(f"BOS: {tokenizer.bos_token_id}")  # Beginning of sequence
print(f"EOS: {tokenizer.eos_token_id}")  # End of sequence
print(f"PAD: {tokenizer.pad_token_id}")  # Padding

WordPiece (BERT)

WordPiece is similar to BPE but uses a different merge criterion — it maximizes the likelihood of the training data rather than frequency of pairs:

Python

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

# WordPiece uses ## prefix for continuation subwords
tokens = tokenizer.tokenize("unbelievable")
print(tokens)  # ['un', '##believe', '##able']

# Unknown words fall back to characters
tokens = tokenizer.tokenize("xylyzptlk")
print(tokens)  # ['xy', '##ly', '##z', '##pt', '##l', '##k']

Key BERT tokens: [CLS] (prepended to every sequence), [SEP] (separates segments), [MASK] (for masked language modeling), [UNK] (unknown token).

SentencePiece (T5, LLaMA)

SentencePiece tokenizes raw bytes directly — no language-specific pre-tokenization needed. This makes it language-agnostic:

Python

import sentencepiece as spm

# Train (on your own data)
spm.SentencePieceTrainer.train(
    input="corpus.txt",
    model_prefix="mymodel",
    vocab_size=32000,
    character_coverage=0.9995,  # covers 99.95% of characters in training data
    model_type="bpe",
)

# Use
sp = spm.SentencePieceProcessor()
sp.load("mymodel.model")

text = "warfarin therapy"
ids = sp.encode(text)
pieces = sp.encode(text, out_type=str)
print(pieces)  # ['▁war', 'far', 'in', '▁therapy']

The ▁ (underscore) prefix marks tokens that begin a new word (preceded by a space in the original text).

Tokenizer Gotchas

Numbers are tokenized inefficiently:

Python

tokenizer.tokenize("12345")  # Often ['1', '2', '3', '4', '5'] — 5 tokens
tokenizer.tokenize("2024")   # Might be ['20', '24'] or ['2024'] depending on vocabulary

This is why arithmetic is hard for LLMs — "1234 + 5678" might be many tokens, and the model must do digit-by-digit arithmetic across token boundaries.

Whitespace matters:

Python

tokenizer.encode("hello")   # Different from
tokenizer.encode(" hello")  # Leading space → different token!

Most modern tokenizers (GPT, LLaMA) treat " hello" and "hello" as different tokens. Careful with prompt formatting.

Case sensitivity:

Python

# GPT-4o and LLaMA are case-sensitive at the token level
tokenizer.encode("Warfarin")   # != tokenizer.encode("warfarin")

Language imbalance: Tokenizers trained primarily on English are inefficient for other languages. "Hello" might be 1 token; its Turkish equivalent might be 3–4 tokens, meaning Turkish users get less context per request.

Counting Tokens for Cost Estimation

Python

import tiktoken  # OpenAI's tokenizer library

enc = tiktoken.encoding_for_model("gpt-4o")

def count_tokens(text: str) -> int:
    return len(enc.encode(text))

def estimate_cost(prompt: str, expected_output_tokens: int = 500) -> dict:
    input_tokens = count_tokens(prompt)
    total_tokens = input_tokens + expected_output_tokens

    # GPT-4o pricing (approximate)
    input_cost = input_tokens / 1_000_000 * 2.50   # $2.50 per 1M input tokens
    output_cost = expected_output_tokens / 1_000_000 * 10.00  # $10 per 1M output tokens

    return {
        "input_tokens": input_tokens,
        "output_tokens": expected_output_tokens,
        "estimated_cost_usd": input_cost + output_cost,
    }

prompt = "Explain warfarin's mechanism of action in detail."
print(estimate_cost(prompt))

Vocabulary Design Decisions

| Decision | Trade-off | |---|---| | Larger vocabulary (100k+) | Fewer tokens per text, more embedding params, better rare-word coverage | | Smaller vocabulary (32k) | More tokens per text, fewer params, faster training, more OOV handling | | Byte fallback | Handles any Unicode — nothing is truly OOV | | Language-specific tokens | More efficient for target language, less portable | | Number tokenization | Token-per-digit vs merged numbers — affects arithmetic ability |

GPT-4 uses ~100k tokens (tiktoken cl100k_base). LLaMA 3 uses 128k tokens. BERT uses 30k tokens with WordPiece.

The vocabulary size directly affects the embedding table size: vocab_size × d_model parameters. For LLaMA-3-8B (vocab_size=128k, d_model=4096), the embedding table alone is 128k × 4096 × 4 bytes = ~2GB.

Tokenization: From Text to Tokens

Why Tokenization?

Byte Pair Encoding (BPE)

Using HuggingFace Tokenizers

WordPiece (BERT)

SentencePiece (T5, LLaMA)

Tokenizer Gotchas

Counting Tokens for Cost Estimation

Vocabulary Design Decisions

Enjoyed this article?

Leave a comment