Positional Encoding
Why transformers need position info; sinusoidal encoding with sin/cos; learned vs fixed; RoPE (rotary positional encoding); ALiBi; code examples.
The Problem: Transformers Are Permutation-Invariant
Unlike RNNs, a transformer processes all tokens in parallel via attention. But the attention formula softmax(Q Ć K^T / sqrt(d_k)) Ć V is permutation-invariant: if you shuffle the input tokens, the output tokens shuffle identically. The model has no native sense of order.
Positional encoding injects position information into the token representations so that "dog bites man" and "man bites dog" produce different outputs.
Option 1: Sinusoidal Encoding (Original Transformer)
The original 2017 paper "Attention Is All You Need" used fixed sinusoidal functions:
PE(pos, 2i) = sin( pos / 10000^(2i/d_model) )
PE(pos, 2i + 1) = cos( pos / 10000^(2i/d_model) )Where:
posis the token position (0, 1, 2, ...)iis the dimension index (0 to d_model/2 - 1)- Even dimensions use sine, odd dimensions use cosine
Properties:
- Each position gets a unique
d_model-dimensional vector - Nearby positions have similar encodings (smooth)
- The encoding generalises beyond training length (in theory)
- No parameters to learn
import numpy as np
import matplotlib.pyplot as plt
def sinusoidal_encoding(max_len: int, d_model: int) -> np.ndarray:
"""
Returns positional encoding matrix of shape (max_len, d_model).
"""
PE = np.zeros((max_len, d_model))
positions = np.arange(max_len)[:, np.newaxis] # (max_len, 1)
dim_indices = np.arange(0, d_model, 2)[np.newaxis, :] # (1, d_model/2)
# Frequency denominator: 10000^(2i/d_model)
div_term = np.power(10000.0, dim_indices / d_model) # (1, d_model/2)
PE[:, 0::2] = np.sin(positions / div_term) # even dims
PE[:, 1::2] = np.cos(positions / div_term) # odd dims
return PE
# Visualise
pe = sinusoidal_encoding(max_len=50, d_model=128)
print("Shape:", pe.shape) # (50, 128)
plt.figure(figsize=(10, 4))
plt.imshow(pe, aspect='auto', cmap='RdBu', vmin=-1, vmax=1)
plt.xlabel("Dimension")
plt.ylabel("Position")
plt.title("Sinusoidal Positional Encoding")
plt.colorbar()
plt.tight_layout()
plt.savefig("sinusoidal_pe.png", dpi=150)
plt.show()PyTorch Module
import torch
import torch.nn as nn
class SinusoidalPositionalEncoding(nn.Module):
"""Fixed sinusoidal positional encoding added to token embeddings."""
def __init__(self, d_model: int, max_len: int = 5000, dropout: float = 0.1):
super().__init__()
self.dropout = nn.Dropout(dropout)
pe = torch.zeros(max_len, d_model)
position = torch.arange(max_len).unsqueeze(1).float()
div_term = torch.exp(
torch.arange(0, d_model, 2).float() * (-np.log(10000.0) / d_model)
)
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
pe = pe.unsqueeze(0) # (1, max_len, d_model) ā batch broadcast
self.register_buffer("pe", pe)
def forward(self, x: torch.Tensor) -> torch.Tensor:
"""x: (B, T, d_model) ā add positional encoding"""
x = x + self.pe[:, :x.size(1), :]
return self.dropout(x)
# Test
pe_module = SinusoidalPositionalEncoding(d_model=256)
x = torch.randn(4, 20, 256)
out = pe_module(x)
print("Output shape:", out.shape) # (4, 20, 256)Option 2: Learned Positional Embeddings
Instead of a fixed formula, each position gets a learnable embedding vector. This is how BERT and GPT-2 handle positions.
class LearnedPositionalEncoding(nn.Module):
"""Trainable position embedding table."""
def __init__(self, d_model: int, max_len: int = 2048):
super().__init__()
# Shape: (max_len, d_model) ā one vector per position
self.position_embedding = nn.Embedding(max_len, d_model)
def forward(self, x: torch.Tensor) -> torch.Tensor:
B, T, _ = x.shape
positions = torch.arange(T, device=x.device) # (T,)
return x + self.position_embedding(positions) # broadcast over batch
# BERT-style embedding layer
class BERTEmbedding(nn.Module):
def __init__(self, vocab_size: int, d_model: int, max_len: int = 512):
super().__init__()
self.token_emb = nn.Embedding(vocab_size, d_model)
self.pos_emb = nn.Embedding(max_len, d_model)
self.norm = nn.LayerNorm(d_model)
self.dropout = nn.Dropout(0.1)
def forward(self, token_ids: torch.Tensor) -> torch.Tensor:
B, T = token_ids.shape
pos = torch.arange(T, device=token_ids.device)
emb = self.token_emb(token_ids) + self.pos_emb(pos)
return self.dropout(self.norm(emb))Learned vs fixed comparison:
| Property | Sinusoidal | Learned | |----------|-----------|---------| | Parameters | 0 | max_len Ć d_model | | Extrapolates beyond max_len | Yes (theoretically) | No | | Fits training data better | No | Yes (memorises position) | | Used by | Original Transformer | BERT, GPT-2, GPT-3 |
Option 3: Rotary Positional Encoding (RoPE)
RoPE (Su et al. 2021) is the dominant approach in modern LLMs (LLaMA, Mistral, Qwen, Gemma). Instead of adding a positional vector to the embedding, RoPE rotates the query and key vectors by an angle proportional to their position.
The key insight: for dot-product attention q Ā· k, if we rotate q by angle m*Īø and k by angle n*Īø, then:
RoPE(q, m) Ā· RoPE(k, n) = f(q, k, m - n)The dot product depends only on the relative position m - n, not absolute positions. This naturally gives attention relative positional bias.
def rotate_half(x: torch.Tensor) -> torch.Tensor:
"""Rotate pairs of dimensions by 90 degrees."""
x1, x2 = x[..., : x.shape[-1] // 2], x[..., x.shape[-1] // 2 :]
return torch.cat([-x2, x1], dim=-1)
def apply_rope(q: torch.Tensor, k: torch.Tensor, cos: torch.Tensor, sin: torch.Tensor):
"""
Apply rotary positional encoding to Q and K.
q, k: (B, h, T, d_k)
cos, sin: (T, d_k) ā precomputed rotation angles
"""
cos = cos.unsqueeze(0).unsqueeze(0) # (1, 1, T, d_k)
sin = sin.unsqueeze(0).unsqueeze(0)
q_rot = q * cos + rotate_half(q) * sin
k_rot = k * cos + rotate_half(k) * sin
return q_rot, k_rot
def build_rope_cache(max_len: int, d_k: int, base: float = 10000.0):
"""Precompute cos and sin rotation matrices."""
theta = 1.0 / (base ** (torch.arange(0, d_k, 2).float() / d_k))
positions = torch.arange(max_len).float()
freqs = torch.outer(positions, theta) # (max_len, d_k/2)
freqs = torch.cat([freqs, freqs], dim=-1) # (max_len, d_k)
return freqs.cos(), freqs.sin()
# Quick test
B, h, T, d_k = 2, 8, 16, 64
cos_cache, sin_cache = build_rope_cache(T, d_k)
q = torch.randn(B, h, T, d_k)
k = torch.randn(B, h, T, d_k)
q_rot, k_rot = apply_rope(q, k, cos_cache, sin_cache)
print("Rotated Q shape:", q_rot.shape) # (2, 8, 16, 64)Option 4: ALiBi (Attention with Linear Biases)
ALiBi (Press et al. 2022) adds a negative bias to attention logits proportional to the distance between tokens:
attn_logit(i, j) = q_i Ā· k_j / sqrt(d_k) - slope_h Ć |i - j|Each head h gets a different slope slope_h. Closer tokens get less penalty; farther tokens get more. No positional vectors are added to embeddings at all.
def build_alibi_slopes(n_heads: int) -> torch.Tensor:
"""Compute per-head ALiBi slopes."""
# Geometric sequence: 2^(-8/n), 2^(-16/n), ...
m = torch.arange(1, n_heads + 1, dtype=torch.float32)
slopes = 2 ** (-8.0 * m / n_heads)
return slopes # (n_heads,)
def alibi_bias(seq_len: int, n_heads: int) -> torch.Tensor:
"""
Returns ALiBi bias tensor of shape (n_heads, seq_len, seq_len).
bias[h, i, j] = -slope_h * |i - j|
"""
slopes = build_alibi_slopes(n_heads) # (h,)
positions = torch.arange(seq_len, dtype=torch.float32)
distances = (positions.unsqueeze(0) - positions.unsqueeze(1)).abs() # (T, T)
# (h, 1, 1) Ć (1, T, T) ā (h, T, T)
bias = -slopes.view(-1, 1, 1) * distances.unsqueeze(0)
return bias
# Test
bias = alibi_bias(seq_len=10, n_heads=8)
print("ALiBi bias shape:", bias.shape) # (8, 10, 10)
print("Head 0, row 5:", bias[0, 5, :]) # Largest negative values far from pos 5Comparison Summary
| Method | Added to | Relative? | Extrapolates? | Used by | |--------|----------|-----------|---------------|---------| | Sinusoidal | Embeddings | No | Partially | Transformer (2017) | | Learned | Embeddings | No | No | BERT, GPT-2 | | RoPE | Q and K only | Yes | Yes (with tricks) | LLaMA, Mistral, Qwen | | ALiBi | Attention logits | Yes | Yes | BLOOM, MPT |
YaRN: Extending Context with RoPE
Models trained with RoPE at 4k context often struggle beyond that window. YaRN (Peng et al. 2023) rescales RoPE frequencies to extend context:
def build_yarn_rope_cache(
max_len: int,
d_k: int,
original_max_len: int = 4096,
scale: float = 16.0,
base: float = 10000.0,
):
"""
YaRN rescales the base frequency to extend context window.
scale = new_context / original_context
"""
# Scale up the base temperature
new_base = base * (scale ** (d_k / (d_k - 2)))
theta = 1.0 / (new_base ** (torch.arange(0, d_k, 2).float() / d_k))
positions = torch.arange(max_len).float()
freqs = torch.outer(positions, theta)
freqs = torch.cat([freqs, freqs], dim=-1)
return freqs.cos(), freqs.sin()
cos_yarn, sin_yarn = build_yarn_rope_cache(max_len=65536, d_k=128)
print("YaRN cache shape:", cos_yarn.shape) # (65536, 128)Key Takeaways
- Transformers need explicit position info because attention is permutation-invariant.
- Sinusoidal encodings are fixed, parameter-free, and theoretically extrapolatable.
- Learned encodings fit training data better but fail at unseen lengths.
- RoPE rotates Q and K so that their dot product reflects relative distance ā the current standard in LLMs.
- ALiBi biases attention logits by distance ā simple and effective for length generalisation.
- YaRN and similar methods allow extending the effective context window of RoPE models without full retraining.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.