Embeddings: Token and Positional Representations
How transformers convert token IDs into dense vectors. Token embeddings, positional encodings (sinusoidal and learned), and how they combine to form the model's input.
From Token IDs to Vectors
After tokenization, each token is an integer ID. Transformers need continuous vector representations to compute attention and FFN operations. The embedding layer is a learned lookup table:
Token ID ā Embedding vector ā R^d_modelThe embedding table has shape (vocab_size, d_model). Looking up token 42 retrieves row 42 ā a d_model-dimensional vector that the model has learned to associate with that token.
import torch
import torch.nn as nn
class TokenEmbedding(nn.Module):
def __init__(self, vocab_size: int, d_model: int):
super().__init__()
self.embedding = nn.Embedding(vocab_size, d_model)
self.d_model = d_model
def forward(self, token_ids: torch.Tensor) -> torch.Tensor:
# token_ids: (batch, seq_len)
# Returns: (batch, seq_len, d_model)
# Scale by sqrt(d_model) ā from the original paper
return self.embedding(token_ids) * (self.d_model ** 0.5)The sqrt(d_model) scaling keeps the embedding magnitudes in a reasonable range relative to the positional encoding magnitudes added next.
Why Position Matters
Self-attention is permutation-invariant ā it treats "dog bites man" and "man bites dog" identically without positional information. The model must receive explicit position signals so it knows token order.
Position information is added to the token embeddings before the first attention layer:
Input representation = Token Embedding + Positional EncodingSinusoidal Positional Encoding (Original Transformer)
The original "Attention is All You Need" paper uses fixed sinusoidal encodings:
import torch
import math
def sinusoidal_positional_encoding(max_seq_len: int, d_model: int) -> torch.Tensor:
"""Fixed sinusoidal encodings ā not learned."""
pe = torch.zeros(max_seq_len, d_model)
position = torch.arange(0, max_seq_len, dtype=torch.float).unsqueeze(1)
# Dimension-dependent frequencies
div_term = torch.exp(
torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model)
)
pe[:, 0::2] = torch.sin(position * div_term) # Even dimensions
pe[:, 1::2] = torch.cos(position * div_term) # Odd dimensions
return pe # (max_seq_len, d_model)
# Visualize: each row is a position's encoding
pe = sinusoidal_positional_encoding(100, 512)
print(f"Shape: {pe.shape}") # (100, 512)
print(f"Position 0: {pe[0, :4]}") # [0, 1, 0, 1, ...]
print(f"Position 10: {pe[10, :4]}") # Different patternProperties of sinusoidal encodings:
- No parameters to learn ā works at inference on sequences longer than training
- The model can compute relative positions from dot products of position vectors
- Different frequencies at different dimensions allow the model to attend to both local and global position relationships
Learned Positional Embeddings (BERT, GPT)
Most modern models use a learned embedding table for positions ā just another nn.Embedding:
class TransformerInput(nn.Module):
def __init__(self, vocab_size: int, d_model: int, max_seq_len: int, dropout: float = 0.1):
super().__init__()
self.token_embedding = nn.Embedding(vocab_size, d_model)
self.position_embedding = nn.Embedding(max_seq_len, d_model)
self.dropout = nn.Dropout(dropout)
self.d_model = d_model
def forward(self, token_ids: torch.Tensor) -> torch.Tensor:
# token_ids: (batch, seq_len)
batch_size, seq_len = token_ids.shape
# Create position indices [0, 1, 2, ..., seq_len-1]
positions = torch.arange(seq_len, device=token_ids.device).unsqueeze(0)
positions = positions.expand(batch_size, -1)
# Combine token and position embeddings
tok_emb = self.token_embedding(token_ids) * (self.d_model ** 0.5)
pos_emb = self.position_embedding(positions)
return self.dropout(tok_emb + pos_emb)Limitation: Learned position embeddings don't generalize beyond max_seq_len. GPT-2 is trained with 1024 positions ā it cannot be directly applied to sequences of length 2048.
Rotary Position Embeddings (RoPE)
RoPE (Su et al., 2021) is used in LLaMA, Mistral, and most modern open-source models. Instead of adding position to embeddings before attention, RoPE rotates the query and key vectors within the attention computation:
def apply_rotary_embeddings(
x: torch.Tensor, # (batch, seq_len, num_heads, head_dim)
cos: torch.Tensor, # precomputed cosines for each position
sin: torch.Tensor, # precomputed sines for each position
) -> torch.Tensor:
"""Apply rotary position embedding to q or k vectors."""
# Rotate pairs of dimensions
x1 = x[..., 0::2] # Even dims
x2 = x[..., 1::2] # Odd dims
rotated = torch.cat([-x2, x1], dim=-1)
return x * cos + rotated * sin
def precompute_rope_freqs(head_dim: int, max_seq_len: int, base: float = 10000.0):
"""Precompute cosine and sine for each position."""
theta = 1.0 / (base ** (torch.arange(0, head_dim, 2).float() / head_dim))
positions = torch.arange(max_seq_len).float()
freqs = torch.outer(positions, theta) # (max_seq_len, head_dim/2)
freqs = torch.cat([freqs, freqs], dim=-1) # (max_seq_len, head_dim)
return freqs.cos(), freqs.sin()RoPE advantages:
- Position information is relative ā the dot product between Q and K vectors encodes relative position, not absolute
- Can be extended beyond training length with techniques like YaRN or dynamic NTK scaling
- No extra parameters
- Empirically outperforms both sinusoidal and learned absolute positions on long sequences
Embedding Dimensions and Parameters
# Parameter count for embeddings
vocab_size = 32_000
d_model = 4_096
max_seq_len = 4_096
token_embedding_params = vocab_size * d_model # 131M parameters
position_embedding_params = max_seq_len * d_model # 16M parameters (learned)
# RoPE: 0 extra parameters
total_embedding_params = token_embedding_params + position_embedding_params
print(f"Token embedding: {token_embedding_params / 1e6:.0f}M params")
print(f"Position embedding: {position_embedding_params / 1e6:.0f}M params")For LLaMA-3-8B (vocab_size=128k, d_model=4096), the token embedding table alone is 524M parameters ā about 6.5% of the total model. This is why large vocabularies have real memory costs.
Embedding Initialization and Tying Weights
Weight tying (used in GPT-2 and many others) shares weights between the token embedding table and the final output projection layer (the layer that converts the last hidden state back to logits over the vocabulary):
class TransformerWithTiedWeights(nn.Module):
def __init__(self, vocab_size, d_model, ...):
super().__init__()
self.embedding = nn.Embedding(vocab_size, d_model)
# ... transformer layers ...
self.output_projection = nn.Linear(d_model, vocab_size, bias=False)
# Tie weights ā they share the same underlying tensor
self.output_projection.weight = self.embedding.weight
def forward(self, token_ids):
x = self.embedding(token_ids)
# ... transformer layers ...
logits = self.output_projection(x) # uses same weights as embedding
return logitsWeight tying reduces parameters by vocab_size Ć d_model and empirically improves training stability ā the model must learn representations that work both for input lookup and output prediction.
What Embeddings Learn
After training, token embedding vectors encode semantic and syntactic relationships:
- Similar tokens cluster in embedding space (e.g., "cat", "dog", "rabbit" are near each other)
- Analogical relationships form linear structures (king - man + woman ā queen)
- Syntactic categories (nouns, verbs) tend to occupy different regions
These relationships emerge purely from the training objective (next-token prediction) ā the model learns them because they're useful for predicting the next token.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.