Rotary Positional Encoding (RoPE)
How RoPE encodes position by rotating query and key vectors, why relative distance falls out naturally, and why it's become the standard in LLaMA and Mistral.
Core Idea
RoPE (Rotary Position Embedding, Su et al., 2021) encodes position by rotating the query and key vectors before computing attention scores. The rotation angle depends on position, so relative distance between positions appears naturally in the dot product.
Standard attention:
score(i, j) = q_i · k_j
RoPE attention:
score(i, j) = (R_θ_i · q_i) · (R_θ_j · k_j)
= q_i · (R_θ_j - R_θ_i · k_j)
where R_θ_pos is a rotation matrix parameterised by position and frequency θ.
Key property: the dot product depends only on (i - j), not on i and j separately.This means attention scores encode relative distance, not absolute position.
The Rotation Formulation
RoPE works on pairs of dimensions. For a d-dimensional vector, it groups dimensions into d/2 pairs and rotates each pair by an angle θ_k · pos:
For dimension pair (2k, 2k+1) at position pos:
[q_{2k}'] = [cos(pos·θ_k) -sin(pos·θ_k)] [q_{2k} ]
[q_{2k+1}'] [sin(pos·θ_k) cos(pos·θ_k)] [q_{2k+1}]
where θ_k = 1 / 10000^(2k/d) (same frequencies as sinusoidal)Each dimension pair rotates at a different frequency. Low-indexed pairs rotate fast (capturing local position); high-indexed pairs rotate slowly (capturing global position).
Code
import torch
import torch.nn.functional as F
def precompute_rope_freqs(d: int, max_len: int, base: float = 10000.0):
# d must be even
theta = 1.0 / (base ** (torch.arange(0, d, 2).float() / d))
positions = torch.arange(max_len).float()
freqs = torch.outer(positions, theta) # (max_len, d//2)
freqs_cos = torch.cos(freqs) # (max_len, d//2)
freqs_sin = torch.sin(freqs) # (max_len, d//2)
return freqs_cos, freqs_sin
def apply_rope(x: torch.Tensor, freqs_cos, freqs_sin) -> torch.Tensor:
# x: (batch, seq_len, num_heads, head_dim)
x1 = x[..., 0::2] # even dims
x2 = x[..., 1::2] # odd dims
# Rotate: apply [cos, -sin; sin, cos] rotation
rotated_x1 = x1 * freqs_cos - x2 * freqs_sin
rotated_x2 = x1 * freqs_sin + x2 * freqs_cos
# Interleave back
out = torch.stack([rotated_x1, rotated_x2], dim=-1)
return out.flatten(-2)
# In attention:
q = apply_rope(q, freqs_cos, freqs_sin)
k = apply_rope(k, freqs_cos, freqs_sin)
scores = q @ k.transpose(-2, -1) / math.sqrt(head_dim)Why RoPE Is Better Than Learned Absolute
Learned absolute:
q_i · k_j depends on absolute positions i and j separately
Model must learn for each (i, j) pair what the distance means
Cannot extrapolate past max_len
RoPE:
q_i · k_j depends only on (i - j)
Relative distance is structurally encoded in the dot product
Better generalisation within context window
Extended by fine-tuning with longer sequences (YaRN, LongRoPE)Context Length Extension with RoPE
One important property: RoPE's base parameter controls the period of position frequencies. By increasing the base (e.g., from 10000 to 500000), models can be fine-tuned to attend to longer contexts:
Original LLaMA 2: base=10000, max_len=4096
Code Llama: base=1000000, fine-tuned on 16K+ contexts
YaRN (Peng et al., 2023): scale the frequencies with a factor s
θ_k → θ_k / s (slower rotation → longer effective range)
Allows extending 4K context → 128K with minimal fine-tuningWhere RoPE Is Used
- LLaMA 1, LLaMA 2, LLaMA 3
- Mistral 7B, Mixtral 8x7B
- GPT-NeoX
- Falcon (some variants)
- Code Llama
- Most open-source LLMs released after 2022
Interview Answer
"RoPE encodes position by rotating query and key vectors by angles proportional to their position before the dot product. The rotation angle for dimension pair k is pos × θ_k where θ_k = 1/10000^(2k/d). The key property is that the dot product between rotated q_i and k_j depends only on the relative distance (i-j), not the absolute positions. This gives RoPE the structural advantage of relative positional encoding without changing the attention formula. It also extrapolates better than learned absolute embeddings and is now standard in LLaMA, Mistral, and most modern open-source LLMs."
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.