Embeddings: Token and Positional Representations

From Token IDs to Vectors

After tokenization, each token is an integer ID. Transformers need continuous vector representations to compute attention and FFN operations. The embedding layer is a learned lookup table:

Token ID → Embedding vector ∈ R^d_model

The embedding table has shape (vocab_size, d_model). Looking up token 42 retrieves row 42 — a d_model-dimensional vector that the model has learned to associate with that token.

Python

import torch
import torch.nn as nn

class TokenEmbedding(nn.Module):
    def __init__(self, vocab_size: int, d_model: int):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.d_model = d_model

    def forward(self, token_ids: torch.Tensor) -> torch.Tensor:
        # token_ids: (batch, seq_len)
        # Returns: (batch, seq_len, d_model)
        # Scale by sqrt(d_model) — from the original paper
        return self.embedding(token_ids) * (self.d_model ** 0.5)

The sqrt(d_model) scaling keeps the embedding magnitudes in a reasonable range relative to the positional encoding magnitudes added next.

Why Position Matters

Self-attention is permutation-invariant — it treats "dog bites man" and "man bites dog" identically without positional information. The model must receive explicit position signals so it knows token order.

Position information is added to the token embeddings before the first attention layer:

Input representation = Token Embedding + Positional Encoding

Sinusoidal Positional Encoding (Original Transformer)

The original "Attention is All You Need" paper uses fixed sinusoidal encodings:

Python

import torch
import math

def sinusoidal_positional_encoding(max_seq_len: int, d_model: int) -> torch.Tensor:
    """Fixed sinusoidal encodings — not learned."""
    pe = torch.zeros(max_seq_len, d_model)
    position = torch.arange(0, max_seq_len, dtype=torch.float).unsqueeze(1)

    # Dimension-dependent frequencies
    div_term = torch.exp(
        torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model)
    )

    pe[:, 0::2] = torch.sin(position * div_term)  # Even dimensions
    pe[:, 1::2] = torch.cos(position * div_term)  # Odd dimensions

    return pe  # (max_seq_len, d_model)

# Visualize: each row is a position's encoding
pe = sinusoidal_positional_encoding(100, 512)
print(f"Shape: {pe.shape}")  # (100, 512)
print(f"Position 0: {pe[0, :4]}")   # [0, 1, 0, 1, ...]
print(f"Position 10: {pe[10, :4]}") # Different pattern

Properties of sinusoidal encodings:

No parameters to learn — works at inference on sequences longer than training
The model can compute relative positions from dot products of position vectors
Different frequencies at different dimensions allow the model to attend to both local and global position relationships

Learned Positional Embeddings (BERT, GPT)

Most modern models use a learned embedding table for positions — just another nn.Embedding:

Python

class TransformerInput(nn.Module):
    def __init__(self, vocab_size: int, d_model: int, max_seq_len: int, dropout: float = 0.1):
        super().__init__()
        self.token_embedding = nn.Embedding(vocab_size, d_model)
        self.position_embedding = nn.Embedding(max_seq_len, d_model)
        self.dropout = nn.Dropout(dropout)
        self.d_model = d_model

    def forward(self, token_ids: torch.Tensor) -> torch.Tensor:
        # token_ids: (batch, seq_len)
        batch_size, seq_len = token_ids.shape

        # Create position indices [0, 1, 2, ..., seq_len-1]
        positions = torch.arange(seq_len, device=token_ids.device).unsqueeze(0)
        positions = positions.expand(batch_size, -1)

        # Combine token and position embeddings
        tok_emb = self.token_embedding(token_ids) * (self.d_model ** 0.5)
        pos_emb = self.position_embedding(positions)

        return self.dropout(tok_emb + pos_emb)

Limitation: Learned position embeddings don't generalize beyond max_seq_len. GPT-2 is trained with 1024 positions — it cannot be directly applied to sequences of length 2048.

Rotary Position Embeddings (RoPE)

RoPE (Su et al., 2021) is used in LLaMA, Mistral, and most modern open-source models. Instead of adding position to embeddings before attention, RoPE rotates the query and key vectors within the attention computation:

Python

def apply_rotary_embeddings(
    x: torch.Tensor,  # (batch, seq_len, num_heads, head_dim)
    cos: torch.Tensor,  # precomputed cosines for each position
    sin: torch.Tensor,  # precomputed sines for each position
) -> torch.Tensor:
    """Apply rotary position embedding to q or k vectors."""
    # Rotate pairs of dimensions
    x1 = x[..., 0::2]  # Even dims
    x2 = x[..., 1::2]  # Odd dims
    rotated = torch.cat([-x2, x1], dim=-1)
    return x * cos + rotated * sin

def precompute_rope_freqs(head_dim: int, max_seq_len: int, base: float = 10000.0):
    """Precompute cosine and sine for each position."""
    theta = 1.0 / (base ** (torch.arange(0, head_dim, 2).float() / head_dim))
    positions = torch.arange(max_seq_len).float()
    freqs = torch.outer(positions, theta)  # (max_seq_len, head_dim/2)
    freqs = torch.cat([freqs, freqs], dim=-1)  # (max_seq_len, head_dim)
    return freqs.cos(), freqs.sin()

RoPE advantages:

Position information is relative — the dot product between Q and K vectors encodes relative position, not absolute
Can be extended beyond training length with techniques like YaRN or dynamic NTK scaling
No extra parameters
Empirically outperforms both sinusoidal and learned absolute positions on long sequences

Embedding Dimensions and Parameters

Python

# Parameter count for embeddings
vocab_size = 32_000
d_model = 4_096
max_seq_len = 4_096

token_embedding_params = vocab_size * d_model         # 131M parameters
position_embedding_params = max_seq_len * d_model     # 16M parameters (learned)
# RoPE: 0 extra parameters

total_embedding_params = token_embedding_params + position_embedding_params
print(f"Token embedding: {token_embedding_params / 1e6:.0f}M params")
print(f"Position embedding: {position_embedding_params / 1e6:.0f}M params")

For LLaMA-3-8B (vocab_size=128k, d_model=4096), the token embedding table alone is 524M parameters — about 6.5% of the total model. This is why large vocabularies have real memory costs.

Embedding Initialization and Tying Weights

Weight tying (used in GPT-2 and many others) shares weights between the token embedding table and the final output projection layer (the layer that converts the last hidden state back to logits over the vocabulary):

Python

class TransformerWithTiedWeights(nn.Module):
    def __init__(self, vocab_size, d_model, ...):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        # ... transformer layers ...
        self.output_projection = nn.Linear(d_model, vocab_size, bias=False)

        # Tie weights — they share the same underlying tensor
        self.output_projection.weight = self.embedding.weight

    def forward(self, token_ids):
        x = self.embedding(token_ids)
        # ... transformer layers ...
        logits = self.output_projection(x)  # uses same weights as embedding
        return logits

Weight tying reduces parameters by vocab_size × d_model and empirically improves training stability — the model must learn representations that work both for input lookup and output prediction.

What Embeddings Learn

After training, token embedding vectors encode semantic and syntactic relationships:

Similar tokens cluster in embedding space (e.g., "cat", "dog", "rabbit" are near each other)
Analogical relationships form linear structures (king - man + woman ≈ queen)
Syntactic categories (nouns, verbs) tend to occupy different regions

These relationships emerge purely from the training objective (next-token prediction) — the model learns them because they're useful for predicting the next token.

Embeddings: Token and Positional Representations

From Token IDs to Vectors

Why Position Matters

Sinusoidal Positional Encoding (Original Transformer)

Learned Positional Embeddings (BERT, GPT)

Rotary Position Embeddings (RoPE)

Embedding Dimensions and Parameters

Embedding Initialization and Tying Weights

What Embeddings Learn

Enjoyed this article?

Leave a comment