Tokenization: From Text to Tokens
How tokenizers convert raw text into token IDs that transformers consume. Covers BPE, WordPiece, SentencePiece, vocabulary design, and tokenizer gotchas.
Why Tokenization?
Transformers operate on discrete integer sequences, not raw text. Tokenization maps text → a sequence of integer IDs from a fixed vocabulary. The vocabulary size (typically 32k–100k tokens) is a hyperparameter that trades off:
- Larger vocabulary: Fewer tokens per sentence (faster inference, more context), more embedding parameters, better handling of rare words
- Smaller vocabulary: More tokens per sentence, fewer parameters, more OOV (out-of-vocabulary) risk
Byte Pair Encoding (BPE)
BPE is the algorithm behind GPT-2/3/4, LLaMA, and Mistral tokenizers.
Training algorithm:
- Start with a character-level vocabulary (all unique characters in the corpus)
- Count all adjacent pair frequencies
- Merge the most frequent pair into a new token
- Repeat until vocabulary reaches target size
def train_bpe(corpus: str, vocab_size: int) -> list[tuple[str, str]]:
"""Simplified BPE training — returns list of merge rules."""
# Initialize: each character is a token, words split into chars + </w>
vocab = {}
for word in corpus.split():
chars = tuple(list(word) + ["</w>"])
vocab[chars] = vocab.get(chars, 0) + 1
merges = []
for _ in range(vocab_size):
# Count adjacent pairs
pair_counts = {}
for word_tokens, count in vocab.items():
for i in range(len(word_tokens) - 1):
pair = (word_tokens[i], word_tokens[i + 1])
pair_counts[pair] = pair_counts.get(pair, 0) + count
if not pair_counts:
break
# Find and merge most frequent pair
best_pair = max(pair_counts, key=pair_counts.get)
merges.append(best_pair)
# Apply merge to all words
new_vocab = {}
merged = "".join(best_pair)
for word_tokens, count in vocab.items():
new_tokens = []
i = 0
while i < len(word_tokens):
if i < len(word_tokens) - 1 and (word_tokens[i], word_tokens[i+1]) == best_pair:
new_tokens.append(merged)
i += 2
else:
new_tokens.append(word_tokens[i])
i += 1
new_vocab[tuple(new_tokens)] = count
vocab = new_vocab
return mergesUsing HuggingFace Tokenizers
from transformers import AutoTokenizer
# Load a pretrained tokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3-8B")
text = "Warfarin inhibits VKOR enzyme, reducing vitamin K-dependent clotting."
# Encode: text → token IDs
tokens = tokenizer.encode(text)
print(f"Token IDs: {tokens}")
print(f"Token count: {len(tokens)}")
# Decode back
decoded = tokenizer.decode(tokens)
print(f"Decoded: {decoded}")
# See the actual token strings
token_strings = tokenizer.convert_ids_to_tokens(tokens)
print(f"Tokens: {token_strings}")
# ['▁War', 'farin', '▁inhibits', '▁V', 'KOR', '▁enzyme', ...]
# Special tokens
print(f"BOS: {tokenizer.bos_token_id}") # Beginning of sequence
print(f"EOS: {tokenizer.eos_token_id}") # End of sequence
print(f"PAD: {tokenizer.pad_token_id}") # PaddingWordPiece (BERT)
WordPiece is similar to BPE but uses a different merge criterion — it maximizes the likelihood of the training data rather than frequency of pairs:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
# WordPiece uses ## prefix for continuation subwords
tokens = tokenizer.tokenize("unbelievable")
print(tokens) # ['un', '##believe', '##able']
# Unknown words fall back to characters
tokens = tokenizer.tokenize("xylyzptlk")
print(tokens) # ['xy', '##ly', '##z', '##pt', '##l', '##k']Key BERT tokens: [CLS] (prepended to every sequence), [SEP] (separates segments), [MASK] (for masked language modeling), [UNK] (unknown token).
SentencePiece (T5, LLaMA)
SentencePiece tokenizes raw bytes directly — no language-specific pre-tokenization needed. This makes it language-agnostic:
import sentencepiece as spm
# Train (on your own data)
spm.SentencePieceTrainer.train(
input="corpus.txt",
model_prefix="mymodel",
vocab_size=32000,
character_coverage=0.9995, # covers 99.95% of characters in training data
model_type="bpe",
)
# Use
sp = spm.SentencePieceProcessor()
sp.load("mymodel.model")
text = "warfarin therapy"
ids = sp.encode(text)
pieces = sp.encode(text, out_type=str)
print(pieces) # ['▁war', 'far', 'in', '▁therapy']The ▁ (underscore) prefix marks tokens that begin a new word (preceded by a space in the original text).
Tokenizer Gotchas
Numbers are tokenized inefficiently:
tokenizer.tokenize("12345") # Often ['1', '2', '3', '4', '5'] — 5 tokens
tokenizer.tokenize("2024") # Might be ['20', '24'] or ['2024'] depending on vocabularyThis is why arithmetic is hard for LLMs — "1234 + 5678" might be many tokens, and the model must do digit-by-digit arithmetic across token boundaries.
Whitespace matters:
tokenizer.encode("hello") # Different from
tokenizer.encode(" hello") # Leading space → different token!Most modern tokenizers (GPT, LLaMA) treat " hello" and "hello" as different tokens. Careful with prompt formatting.
Case sensitivity:
# GPT-4o and LLaMA are case-sensitive at the token level
tokenizer.encode("Warfarin") # != tokenizer.encode("warfarin")Language imbalance: Tokenizers trained primarily on English are inefficient for other languages. "Hello" might be 1 token; its Turkish equivalent might be 3–4 tokens, meaning Turkish users get less context per request.
Counting Tokens for Cost Estimation
import tiktoken # OpenAI's tokenizer library
enc = tiktoken.encoding_for_model("gpt-4o")
def count_tokens(text: str) -> int:
return len(enc.encode(text))
def estimate_cost(prompt: str, expected_output_tokens: int = 500) -> dict:
input_tokens = count_tokens(prompt)
total_tokens = input_tokens + expected_output_tokens
# GPT-4o pricing (approximate)
input_cost = input_tokens / 1_000_000 * 2.50 # $2.50 per 1M input tokens
output_cost = expected_output_tokens / 1_000_000 * 10.00 # $10 per 1M output tokens
return {
"input_tokens": input_tokens,
"output_tokens": expected_output_tokens,
"estimated_cost_usd": input_cost + output_cost,
}
prompt = "Explain warfarin's mechanism of action in detail."
print(estimate_cost(prompt))Vocabulary Design Decisions
| Decision | Trade-off | |---|---| | Larger vocabulary (100k+) | Fewer tokens per text, more embedding params, better rare-word coverage | | Smaller vocabulary (32k) | More tokens per text, fewer params, faster training, more OOV handling | | Byte fallback | Handles any Unicode — nothing is truly OOV | | Language-specific tokens | More efficient for target language, less portable | | Number tokenization | Token-per-digit vs merged numbers — affects arithmetic ability |
GPT-4 uses ~100k tokens (tiktoken cl100k_base). LLaMA 3 uses 128k tokens. BERT uses 30k tokens with WordPiece.
The vocabulary size directly affects the embedding table size: vocab_size × d_model parameters. For LLaMA-3-8B (vocab_size=128k, d_model=4096), the embedding table alone is 128k × 4096 × 4 bytes = ~2GB.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.