RoPE and ALiBi: Relative Position Encodings
How Rotary Position Embeddings (RoPE) and Attention with Linear Biases (ALiBi) encode relative position, enabling length generalization beyond training context.
Why Relative Position Encodings
Learned absolute position embeddings fail when sequences exceed training length — there are no embeddings for positions beyond max_seq_len. Sinusoidal encodings generalize better but still use absolute position values that are out-of-distribution at inference.
Relative position encodings are more robust: they encode distance between tokens rather than where a token is. A model that learned "nearby tokens should attend more strongly" generalizes this at any length.
RoPE: Rotary Position Embeddings
RoPE (Su et al., 2021) applies a rotation to the query and key vectors before computing attention. The rotation angle depends on position — so when Q at position m and K at position n interact via dot product, the result automatically encodes the relative position (m - n).
The Math
For a pair of dimensions (i, i+1) in the query/key vector, RoPE rotates by angle θ_i × position:
Q'_i = Q_i × cos(m × θ_i) - Q_{i+1} × sin(m × θ_i)
Q'_{i+1} = Q_i × sin(m × θ_i) + Q_{i+1} × cos(m × θ_i)where θ_i = 1 / base^(2i/d) with base = 10000 (or 500000 in LLaMA-3).
When we compute Q'_m · K'_n, the rotation angles for m and n combine into a rotation by (m-n) × θ_i — encoding relative position.
import torch
import math
def precompute_rope_frequencies(
head_dim: int,
max_seq_len: int,
base: float = 10000.0,
device: str = "cpu",
) -> tuple[torch.Tensor, torch.Tensor]:
"""Precompute cos and sin for RoPE."""
# Frequency for each pair of dimensions
# θ_i = 1 / base^(2i / head_dim)
theta = 1.0 / (base ** (torch.arange(0, head_dim, 2, device=device).float() / head_dim))
# Position indices [0, 1, 2, ..., max_seq_len-1]
positions = torch.arange(max_seq_len, device=device).float()
# Outer product: each position × each frequency
freqs = torch.outer(positions, theta) # (max_seq_len, head_dim/2)
# Concatenate for paired dimensions
freqs = torch.cat([freqs, freqs], dim=-1) # (max_seq_len, head_dim)
return freqs.cos(), freqs.sin()
def apply_rope(
x: torch.Tensor, # (batch, seq_len, num_heads, head_dim)
cos: torch.Tensor, # (seq_len, head_dim)
sin: torch.Tensor, # (seq_len, head_dim)
) -> torch.Tensor:
"""Apply rotary position embeddings to a query or key tensor."""
# Rotate pairs of dimensions
x1 = x[..., 0::2] # Even dims: (batch, seq, heads, head_dim/2)
x2 = x[..., 1::2] # Odd dims
# Interleave to match cos/sin shape
x_rotated = torch.cat([-x2, x1], dim=-1)
# Apply rotation: x * cos + rotate(x) * sin
return x * cos + x_rotated * sin
# Example usage
head_dim = 128
max_seq_len = 8192
cos, sin = precompute_rope_frequencies(head_dim, max_seq_len, base=500000.0)
# In attention: apply to Q and K before dot product
batch, seq_len, n_heads = 2, 100, 32
q = torch.randn(batch, seq_len, n_heads, head_dim)
k = torch.randn(batch, seq_len, n_heads, head_dim)
q_rope = apply_rope(q, cos[:seq_len], sin[:seq_len])
k_rope = apply_rope(k, cos[:seq_len], sin[:seq_len])RoPE Context Length Extension
The base frequency determines how quickly the rotation completes one full cycle. A higher base means slower rotation — effectively lower frequencies that can represent longer ranges:
def ntk_scaling(
original_base: float,
scale_factor: float,
head_dim: int,
) -> float:
"""
NTK-aware scaling: adjust base to support scale_factor × longer context.
Used to extend RoPE beyond training length.
"""
return original_base * (scale_factor ** (head_dim / (head_dim - 2)))
# LLaMA-2: base=10000, max_seq=4096
# To support 32k context (8× extension):
new_base = ntk_scaling(10000.0, 8.0, 128)
print(f"NTK base for 32k context: {new_base:.0f}") # ~641,878
# LLaMA-3 uses base=500000 natively (accounts for extended training)YaRN (Yet Another RoPE eNhancement): More sophisticated than NTK. Applies different scaling factors to different frequency components — high frequencies (local attention) are less stretched than low frequencies (global attention). Used by Mistral for 32k context.
ALiBi: Attention with Linear Biases
ALiBi (Press et al., 2022) takes a completely different approach: instead of modifying embeddings, add a fixed negative bias proportional to distance directly to the attention scores before softmax.
attention_score(i, j) = (Q_i · K_j) / sqrt(d_k) - slope × |i - j|Closer tokens get less penalty; farther tokens get larger penalties. The bias is always negative, so attention naturally focuses more on nearby tokens.
import torch
import torch.nn.functional as F
import math
def get_alibi_slopes(num_heads: int) -> torch.Tensor:
"""
Compute per-head slopes for ALiBi.
Different heads use different slopes → different attention ranges.
"""
# Slopes: 2^(-8/n) for n = 1, 2, ..., num_heads
# Equivalent to: m = {1/2^(8k/n)} for k = 1..n
def get_slopes_power_of_2(n):
start = 2 ** (-(2 ** -(math.log2(n) - 3)))
ratio = start
return [start * ratio ** i for i in range(n)]
if math.log2(num_heads).is_integer():
slopes = get_slopes_power_of_2(num_heads)
else:
# For non-power-of-2 num_heads: closest power of 2 + interpolation
n = 2 ** math.floor(math.log2(num_heads))
slopes = get_slopes_power_of_2(n)
slopes_rest = get_slopes_power_of_2(2 * n)[0::2][: num_heads - n]
slopes = slopes + slopes_rest
return torch.tensor(slopes, dtype=torch.float32)
def alibi_attention(
q: torch.Tensor, # (batch, heads, seq_len, head_dim)
k: torch.Tensor,
v: torch.Tensor,
slopes: torch.Tensor, # (num_heads,)
) -> torch.Tensor:
"""Attention with ALiBi biases."""
batch, heads, seq_len, head_dim = q.shape
scale = head_dim ** -0.5
# Standard attention scores
scores = torch.matmul(q, k.transpose(-2, -1)) * scale # (batch, heads, seq, seq)
# Causal mask
mask = torch.triu(torch.ones(seq_len, seq_len, device=q.device), diagonal=1).bool()
scores = scores.masked_fill(mask, float("-inf"))
# ALiBi bias: -slope × distance
# Distance matrix: |position_i - position_j| for causal attention is just (i - j) for i >= j
positions = torch.arange(seq_len, device=q.device)
distance = positions.unsqueeze(0) - positions.unsqueeze(1) # (seq, seq)
distance = torch.clamp(distance, max=0).abs() # Causal: future positions have distance 0 (masked anyway)
# Apply per-head slopes: (heads, 1, 1) × (1, seq, seq)
alibi_bias = -slopes.view(heads, 1, 1) * distance.unsqueeze(0) # (heads, seq, seq)
scores = scores + alibi_bias.unsqueeze(0) # Broadcast over batch
# Softmax and output
attn = F.softmax(scores, dim=-1)
return torch.matmul(attn, v)
# Example
num_heads = 8
slopes = get_alibi_slopes(num_heads)
print("ALiBi slopes:", slopes.tolist())
# Different heads attend at different ranges: first head attends locally, last head more globallyRoPE vs ALiBi Comparison
| Property | RoPE | ALiBi | |---|---|---| | Where applied | Rotates Q and K vectors | Additive bias on attention scores | | Learnable parameters | No | No | | Length extrapolation | Needs scaling (NTK/YaRN) | Extrapolates well beyond training length | | Quality at training length | Excellent | Excellent | | Quality beyond training length | Good with scaling | Good (bias keeps decaying) | | Used by | LLaMA, Mistral, Qwen, Gemma | MPT, BLOOM, Falcon | | Multi-query compatibility | Yes | Yes |
ALiBi's extrapolation advantage: Because ALiBi uses a simple linear penalty, it naturally discourages attention to very distant tokens — at any sequence length, not just those seen in training. The model learns "nearby tokens matter more" and this principle holds at any scale.
RoPE's quality advantage: At the training length, RoPE generally outperforms ALiBi — the rotation-based encoding carries richer positional information than a scalar bias.
Practical: Implementing RoPE in a Full Attention Layer
class RoPEAttention(torch.nn.Module):
def __init__(self, dim: int, num_heads: int, max_seq_len: int = 8192, rope_base: float = 500000.0):
super().__init__()
self.num_heads = num_heads
self.head_dim = dim // num_heads
self.wq = torch.nn.Linear(dim, dim, bias=False)
self.wk = torch.nn.Linear(dim, dim, bias=False)
self.wv = torch.nn.Linear(dim, dim, bias=False)
self.wo = torch.nn.Linear(dim, dim, bias=False)
# Precompute frequencies (cached, not parameters)
cos, sin = precompute_rope_frequencies(self.head_dim, max_seq_len, rope_base)
self.register_buffer("cos", cos)
self.register_buffer("sin", sin)
def forward(self, x: torch.Tensor) -> torch.Tensor:
batch, seq_len, dim = x.shape
q = self.wq(x).view(batch, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
k = self.wk(x).view(batch, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
v = self.wv(x).view(batch, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
# Apply RoPE to Q and K
cos = self.cos[:seq_len].unsqueeze(0).unsqueeze(0) # (1, 1, seq, head_dim)
sin = self.sin[:seq_len].unsqueeze(0).unsqueeze(0)
q = apply_rope(q, cos, sin)
k = apply_rope(k, cos, sin)
# Flash attention (or standard)
out = F.scaled_dot_product_attention(q, k, v, is_causal=True)
out = out.transpose(1, 2).contiguous().view(batch, seq_len, dim)
return self.wo(out)The LLaMA family (LLaMA-1 through LLaMA-3), Mistral, Qwen-2, and most modern open-source models all use this RoPE-based pattern with minor variations in base frequency and scaling strategy.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.