Learnixo

Transformer Architecture Q&A · Lesson 14 of 23

RoPE: Rotary Position Embedding

Core Idea

RoPE (Rotary Position Embedding, Su et al., 2021) encodes position by rotating the query and key vectors before computing attention scores. The rotation angle depends on position, so relative distance between positions appears naturally in the dot product.

Standard attention:
  score(i, j) = q_i · k_j

RoPE attention:
  score(i, j) = (R_θ_i · q_i) · (R_θ_j · k_j)
              = q_i · (R_θ_j - R_θ_i · k_j)

where R_θ_pos is a rotation matrix parameterised by position and frequency θ.

Key property: the dot product depends only on (i - j), not on i and j separately.

This means attention scores encode relative distance, not absolute position.


The Rotation Formulation

RoPE works on pairs of dimensions. For a d-dimensional vector, it groups dimensions into d/2 pairs and rotates each pair by an angle θ_k · pos:

For dimension pair (2k, 2k+1) at position pos:

[q_{2k}']    = [cos(pos·θ_k)  -sin(pos·θ_k)] [q_{2k}  ]
[q_{2k+1}']    [sin(pos·θ_k)   cos(pos·θ_k)] [q_{2k+1}]

where θ_k = 1 / 10000^(2k/d)  (same frequencies as sinusoidal)

Each dimension pair rotates at a different frequency. Low-indexed pairs rotate fast (capturing local position); high-indexed pairs rotate slowly (capturing global position).


Code

Python
import torch
import torch.nn.functional as F

def precompute_rope_freqs(d: int, max_len: int, base: float = 10000.0):
    # d must be even
    theta = 1.0 / (base ** (torch.arange(0, d, 2).float() / d))
    positions = torch.arange(max_len).float()
    freqs = torch.outer(positions, theta)          # (max_len, d//2)
    freqs_cos = torch.cos(freqs)                   # (max_len, d//2)
    freqs_sin = torch.sin(freqs)                   # (max_len, d//2)
    return freqs_cos, freqs_sin

def apply_rope(x: torch.Tensor, freqs_cos, freqs_sin) -> torch.Tensor:
    # x: (batch, seq_len, num_heads, head_dim)
    x1 = x[..., 0::2]   # even dims
    x2 = x[..., 1::2]   # odd dims

    # Rotate: apply [cos, -sin; sin, cos] rotation
    rotated_x1 = x1 * freqs_cos - x2 * freqs_sin
    rotated_x2 = x1 * freqs_sin + x2 * freqs_cos

    # Interleave back
    out = torch.stack([rotated_x1, rotated_x2], dim=-1)
    return out.flatten(-2)

# In attention:
q = apply_rope(q, freqs_cos, freqs_sin)
k = apply_rope(k, freqs_cos, freqs_sin)
scores = q @ k.transpose(-2, -1) / math.sqrt(head_dim)

Why RoPE Is Better Than Learned Absolute

Learned absolute:
  q_i · k_j depends on absolute positions i and j separately
  Model must learn for each (i, j) pair what the distance means
  Cannot extrapolate past max_len

RoPE:
  q_i · k_j depends only on (i - j)
  Relative distance is structurally encoded in the dot product
  Better generalisation within context window
  Extended by fine-tuning with longer sequences (YaRN, LongRoPE)

Context Length Extension with RoPE

One important property: RoPE's base parameter controls the period of position frequencies. By increasing the base (e.g., from 10000 to 500000), models can be fine-tuned to attend to longer contexts:

Original LLaMA 2: base=10000, max_len=4096
Code Llama:       base=1000000, fine-tuned on 16K+ contexts

YaRN (Peng et al., 2023): scale the frequencies with a factor s
  θ_k → θ_k / s  (slower rotation → longer effective range)
  Allows extending 4K context → 128K with minimal fine-tuning

Where RoPE Is Used

  • LLaMA 1, LLaMA 2, LLaMA 3
  • Mistral 7B, Mixtral 8x7B
  • GPT-NeoX
  • Falcon (some variants)
  • Code Llama
  • Most open-source LLMs released after 2022

Interview Answer

"RoPE encodes position by rotating query and key vectors by angles proportional to their position before the dot product. The rotation angle for dimension pair k is pos × θ_k where θ_k = 1/10000^(2k/d). The key property is that the dot product between rotated q_i and k_j depends only on the relative distance (i-j), not the absolute positions. This gives RoPE the structural advantage of relative positional encoding without changing the attention formula. It also extrapolates better than learned absolute embeddings and is now standard in LLaMA, Mistral, and most modern open-source LLMs."