Learnixo
Back to blog
AI Systemsintermediate

Tokenization: From Text to Tokens

How tokenizers convert raw text into token IDs that transformers consume. Covers BPE, WordPiece, SentencePiece, vocabulary design, and tokenizer gotchas.

Asma Hafeez KhanMay 16, 20265 min read
TransformersTokenizationBPENLP
Share:𝕏

Why Tokenization?

Transformers operate on discrete integer sequences, not raw text. Tokenization maps text → a sequence of integer IDs from a fixed vocabulary. The vocabulary size (typically 32k–100k tokens) is a hyperparameter that trades off:

  • Larger vocabulary: Fewer tokens per sentence (faster inference, more context), more embedding parameters, better handling of rare words
  • Smaller vocabulary: More tokens per sentence, fewer parameters, more OOV (out-of-vocabulary) risk

Byte Pair Encoding (BPE)

BPE is the algorithm behind GPT-2/3/4, LLaMA, and Mistral tokenizers.

Training algorithm:

  1. Start with a character-level vocabulary (all unique characters in the corpus)
  2. Count all adjacent pair frequencies
  3. Merge the most frequent pair into a new token
  4. Repeat until vocabulary reaches target size
Python
def train_bpe(corpus: str, vocab_size: int) -> list[tuple[str, str]]:
    """Simplified BPE training — returns list of merge rules."""
    # Initialize: each character is a token, words split into chars + </w>
    vocab = {}
    for word in corpus.split():
        chars = tuple(list(word) + ["</w>"])
        vocab[chars] = vocab.get(chars, 0) + 1

    merges = []

    for _ in range(vocab_size):
        # Count adjacent pairs
        pair_counts = {}
        for word_tokens, count in vocab.items():
            for i in range(len(word_tokens) - 1):
                pair = (word_tokens[i], word_tokens[i + 1])
                pair_counts[pair] = pair_counts.get(pair, 0) + count

        if not pair_counts:
            break

        # Find and merge most frequent pair
        best_pair = max(pair_counts, key=pair_counts.get)
        merges.append(best_pair)

        # Apply merge to all words
        new_vocab = {}
        merged = "".join(best_pair)
        for word_tokens, count in vocab.items():
            new_tokens = []
            i = 0
            while i < len(word_tokens):
                if i < len(word_tokens) - 1 and (word_tokens[i], word_tokens[i+1]) == best_pair:
                    new_tokens.append(merged)
                    i += 2
                else:
                    new_tokens.append(word_tokens[i])
                    i += 1
            new_vocab[tuple(new_tokens)] = count
        vocab = new_vocab

    return merges

Using HuggingFace Tokenizers

Python
from transformers import AutoTokenizer

# Load a pretrained tokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3-8B")

text = "Warfarin inhibits VKOR enzyme, reducing vitamin K-dependent clotting."

# Encode: text  token IDs
tokens = tokenizer.encode(text)
print(f"Token IDs: {tokens}")
print(f"Token count: {len(tokens)}")

# Decode back
decoded = tokenizer.decode(tokens)
print(f"Decoded: {decoded}")

# See the actual token strings
token_strings = tokenizer.convert_ids_to_tokens(tokens)
print(f"Tokens: {token_strings}")
# ['▁War', 'farin', '▁inhibits', '▁V', 'KOR', '▁enzyme', ...]

# Special tokens
print(f"BOS: {tokenizer.bos_token_id}")  # Beginning of sequence
print(f"EOS: {tokenizer.eos_token_id}")  # End of sequence
print(f"PAD: {tokenizer.pad_token_id}")  # Padding

WordPiece (BERT)

WordPiece is similar to BPE but uses a different merge criterion — it maximizes the likelihood of the training data rather than frequency of pairs:

Python
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

# WordPiece uses ## prefix for continuation subwords
tokens = tokenizer.tokenize("unbelievable")
print(tokens)  # ['un', '##believe', '##able']

# Unknown words fall back to characters
tokens = tokenizer.tokenize("xylyzptlk")
print(tokens)  # ['xy', '##ly', '##z', '##pt', '##l', '##k']

Key BERT tokens: [CLS] (prepended to every sequence), [SEP] (separates segments), [MASK] (for masked language modeling), [UNK] (unknown token).


SentencePiece (T5, LLaMA)

SentencePiece tokenizes raw bytes directly — no language-specific pre-tokenization needed. This makes it language-agnostic:

Python
import sentencepiece as spm

# Train (on your own data)
spm.SentencePieceTrainer.train(
    input="corpus.txt",
    model_prefix="mymodel",
    vocab_size=32000,
    character_coverage=0.9995,  # covers 99.95% of characters in training data
    model_type="bpe",
)

# Use
sp = spm.SentencePieceProcessor()
sp.load("mymodel.model")

text = "warfarin therapy"
ids = sp.encode(text)
pieces = sp.encode(text, out_type=str)
print(pieces)  # ['▁war', 'far', 'in', '▁therapy']

The (underscore) prefix marks tokens that begin a new word (preceded by a space in the original text).


Tokenizer Gotchas

Numbers are tokenized inefficiently:

Python
tokenizer.tokenize("12345")  # Often ['1', '2', '3', '4', '5']  5 tokens
tokenizer.tokenize("2024")   # Might be ['20', '24'] or ['2024'] depending on vocabulary

This is why arithmetic is hard for LLMs — "1234 + 5678" might be many tokens, and the model must do digit-by-digit arithmetic across token boundaries.

Whitespace matters:

Python
tokenizer.encode("hello")   # Different from
tokenizer.encode(" hello")  # Leading space  different token!

Most modern tokenizers (GPT, LLaMA) treat " hello" and "hello" as different tokens. Careful with prompt formatting.

Case sensitivity:

Python
# GPT-4o and LLaMA are case-sensitive at the token level
tokenizer.encode("Warfarin")   # != tokenizer.encode("warfarin")

Language imbalance: Tokenizers trained primarily on English are inefficient for other languages. "Hello" might be 1 token; its Turkish equivalent might be 3–4 tokens, meaning Turkish users get less context per request.


Counting Tokens for Cost Estimation

Python
import tiktoken  # OpenAI's tokenizer library

enc = tiktoken.encoding_for_model("gpt-4o")

def count_tokens(text: str) -> int:
    return len(enc.encode(text))

def estimate_cost(prompt: str, expected_output_tokens: int = 500) -> dict:
    input_tokens = count_tokens(prompt)
    total_tokens = input_tokens + expected_output_tokens

    # GPT-4o pricing (approximate)
    input_cost = input_tokens / 1_000_000 * 2.50   # $2.50 per 1M input tokens
    output_cost = expected_output_tokens / 1_000_000 * 10.00  # $10 per 1M output tokens

    return {
        "input_tokens": input_tokens,
        "output_tokens": expected_output_tokens,
        "estimated_cost_usd": input_cost + output_cost,
    }

prompt = "Explain warfarin's mechanism of action in detail."
print(estimate_cost(prompt))

Vocabulary Design Decisions

| Decision | Trade-off | |---|---| | Larger vocabulary (100k+) | Fewer tokens per text, more embedding params, better rare-word coverage | | Smaller vocabulary (32k) | More tokens per text, fewer params, faster training, more OOV handling | | Byte fallback | Handles any Unicode — nothing is truly OOV | | Language-specific tokens | More efficient for target language, less portable | | Number tokenization | Token-per-digit vs merged numbers — affects arithmetic ability |

GPT-4 uses ~100k tokens (tiktoken cl100k_base). LLaMA 3 uses 128k tokens. BERT uses 30k tokens with WordPiece.

The vocabulary size directly affects the embedding table size: vocab_size × d_model parameters. For LLaMA-3-8B (vocab_size=128k, d_model=4096), the embedding table alone is 128k × 4096 × 4 bytes = ~2GB.

Enjoyed this article?

Explore the AI Systems learning path for more.

Found this helpful?

Share:𝕏

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.