RoPE and ALiBi: Relative Position Encodings

Why Relative Position Encodings

Learned absolute position embeddings fail when sequences exceed training length — there are no embeddings for positions beyond max_seq_len. Sinusoidal encodings generalize better but still use absolute position values that are out-of-distribution at inference.

Relative position encodings are more robust: they encode distance between tokens rather than where a token is. A model that learned "nearby tokens should attend more strongly" generalizes this at any length.

RoPE: Rotary Position Embeddings

RoPE (Su et al., 2021) applies a rotation to the query and key vectors before computing attention. The rotation angle depends on position — so when Q at position m and K at position n interact via dot product, the result automatically encodes the relative position (m - n).

The Math

For a pair of dimensions (i, i+1) in the query/key vector, RoPE rotates by angle θ_i × position:

Q'_i = Q_i × cos(m × θ_i) - Q_{i+1} × sin(m × θ_i)
Q'_{i+1} = Q_i × sin(m × θ_i) + Q_{i+1} × cos(m × θ_i)

where θ_i = 1 / base^(2i/d) with base = 10000 (or 500000 in LLaMA-3).

When we compute Q'_m · K'_n, the rotation angles for m and n combine into a rotation by (m-n) × θ_i — encoding relative position.

Python

import torch
import math

def precompute_rope_frequencies(
    head_dim: int,
    max_seq_len: int,
    base: float = 10000.0,
    device: str = "cpu",
) -> tuple[torch.Tensor, torch.Tensor]:
    """Precompute cos and sin for RoPE."""
    # Frequency for each pair of dimensions
    # θ_i = 1 / base^(2i / head_dim)
    theta = 1.0 / (base ** (torch.arange(0, head_dim, 2, device=device).float() / head_dim))

    # Position indices [0, 1, 2, ..., max_seq_len-1]
    positions = torch.arange(max_seq_len, device=device).float()

    # Outer product: each position × each frequency
    freqs = torch.outer(positions, theta)  # (max_seq_len, head_dim/2)

    # Concatenate for paired dimensions
    freqs = torch.cat([freqs, freqs], dim=-1)  # (max_seq_len, head_dim)

    return freqs.cos(), freqs.sin()

def apply_rope(
    x: torch.Tensor,  # (batch, seq_len, num_heads, head_dim)
    cos: torch.Tensor,  # (seq_len, head_dim)
    sin: torch.Tensor,  # (seq_len, head_dim)
) -> torch.Tensor:
    """Apply rotary position embeddings to a query or key tensor."""
    # Rotate pairs of dimensions
    x1 = x[..., 0::2]  # Even dims: (batch, seq, heads, head_dim/2)
    x2 = x[..., 1::2]  # Odd dims

    # Interleave to match cos/sin shape
    x_rotated = torch.cat([-x2, x1], dim=-1)

    # Apply rotation: x * cos + rotate(x) * sin
    return x * cos + x_rotated * sin

# Example usage
head_dim = 128
max_seq_len = 8192

cos, sin = precompute_rope_frequencies(head_dim, max_seq_len, base=500000.0)

# In attention: apply to Q and K before dot product
batch, seq_len, n_heads = 2, 100, 32
q = torch.randn(batch, seq_len, n_heads, head_dim)
k = torch.randn(batch, seq_len, n_heads, head_dim)

q_rope = apply_rope(q, cos[:seq_len], sin[:seq_len])
k_rope = apply_rope(k, cos[:seq_len], sin[:seq_len])

RoPE Context Length Extension

The base frequency determines how quickly the rotation completes one full cycle. A higher base means slower rotation — effectively lower frequencies that can represent longer ranges:

Python

def ntk_scaling(
    original_base: float,
    scale_factor: float,
    head_dim: int,
) -> float:
    """
    NTK-aware scaling: adjust base to support scale_factor × longer context.
    Used to extend RoPE beyond training length.
    """
    return original_base * (scale_factor ** (head_dim / (head_dim - 2)))

# LLaMA-2: base=10000, max_seq=4096
# To support 32k context (8× extension):
new_base = ntk_scaling(10000.0, 8.0, 128)
print(f"NTK base for 32k context: {new_base:.0f}")  # ~641,878

# LLaMA-3 uses base=500000 natively (accounts for extended training)

YaRN (Yet Another RoPE eNhancement): More sophisticated than NTK. Applies different scaling factors to different frequency components — high frequencies (local attention) are less stretched than low frequencies (global attention). Used by Mistral for 32k context.

ALiBi: Attention with Linear Biases

ALiBi (Press et al., 2022) takes a completely different approach: instead of modifying embeddings, add a fixed negative bias proportional to distance directly to the attention scores before softmax.

attention_score(i, j) = (Q_i · K_j) / sqrt(d_k) - slope × |i - j|

Closer tokens get less penalty; farther tokens get larger penalties. The bias is always negative, so attention naturally focuses more on nearby tokens.

Python

import torch
import torch.nn.functional as F
import math

def get_alibi_slopes(num_heads: int) -> torch.Tensor:
    """
    Compute per-head slopes for ALiBi.
    Different heads use different slopes → different attention ranges.
    """
    # Slopes: 2^(-8/n) for n = 1, 2, ..., num_heads
    # Equivalent to: m = {1/2^(8k/n)} for k = 1..n
    def get_slopes_power_of_2(n):
        start = 2 ** (-(2 ** -(math.log2(n) - 3)))
        ratio = start
        return [start * ratio ** i for i in range(n)]

    if math.log2(num_heads).is_integer():
        slopes = get_slopes_power_of_2(num_heads)
    else:
        # For non-power-of-2 num_heads: closest power of 2 + interpolation
        n = 2 ** math.floor(math.log2(num_heads))
        slopes = get_slopes_power_of_2(n)
        slopes_rest = get_slopes_power_of_2(2 * n)[0::2][: num_heads - n]
        slopes = slopes + slopes_rest

    return torch.tensor(slopes, dtype=torch.float32)

def alibi_attention(
    q: torch.Tensor,  # (batch, heads, seq_len, head_dim)
    k: torch.Tensor,
    v: torch.Tensor,
    slopes: torch.Tensor,  # (num_heads,)
) -> torch.Tensor:
    """Attention with ALiBi biases."""
    batch, heads, seq_len, head_dim = q.shape
    scale = head_dim ** -0.5

    # Standard attention scores
    scores = torch.matmul(q, k.transpose(-2, -1)) * scale  # (batch, heads, seq, seq)

    # Causal mask
    mask = torch.triu(torch.ones(seq_len, seq_len, device=q.device), diagonal=1).bool()
    scores = scores.masked_fill(mask, float("-inf"))

    # ALiBi bias: -slope × distance
    # Distance matrix: |position_i - position_j| for causal attention is just (i - j) for i >= j
    positions = torch.arange(seq_len, device=q.device)
    distance = positions.unsqueeze(0) - positions.unsqueeze(1)  # (seq, seq)
    distance = torch.clamp(distance, max=0).abs()  # Causal: future positions have distance 0 (masked anyway)

    # Apply per-head slopes: (heads, 1, 1) × (1, seq, seq)
    alibi_bias = -slopes.view(heads, 1, 1) * distance.unsqueeze(0)  # (heads, seq, seq)
    scores = scores + alibi_bias.unsqueeze(0)  # Broadcast over batch

    # Softmax and output
    attn = F.softmax(scores, dim=-1)
    return torch.matmul(attn, v)

# Example
num_heads = 8
slopes = get_alibi_slopes(num_heads)
print("ALiBi slopes:", slopes.tolist())
# Different heads attend at different ranges: first head attends locally, last head more globally

RoPE vs ALiBi Comparison

| Property | RoPE | ALiBi | |---|---|---| | Where applied | Rotates Q and K vectors | Additive bias on attention scores | | Learnable parameters | No | No | | Length extrapolation | Needs scaling (NTK/YaRN) | Extrapolates well beyond training length | | Quality at training length | Excellent | Excellent | | Quality beyond training length | Good with scaling | Good (bias keeps decaying) | | Used by | LLaMA, Mistral, Qwen, Gemma | MPT, BLOOM, Falcon | | Multi-query compatibility | Yes | Yes |

ALiBi's extrapolation advantage: Because ALiBi uses a simple linear penalty, it naturally discourages attention to very distant tokens — at any sequence length, not just those seen in training. The model learns "nearby tokens matter more" and this principle holds at any scale.

RoPE's quality advantage: At the training length, RoPE generally outperforms ALiBi — the rotation-based encoding carries richer positional information than a scalar bias.

Practical: Implementing RoPE in a Full Attention Layer

Python

class RoPEAttention(torch.nn.Module):
    def __init__(self, dim: int, num_heads: int, max_seq_len: int = 8192, rope_base: float = 500000.0):
        super().__init__()
        self.num_heads = num_heads
        self.head_dim = dim // num_heads

        self.wq = torch.nn.Linear(dim, dim, bias=False)
        self.wk = torch.nn.Linear(dim, dim, bias=False)
        self.wv = torch.nn.Linear(dim, dim, bias=False)
        self.wo = torch.nn.Linear(dim, dim, bias=False)

        # Precompute frequencies (cached, not parameters)
        cos, sin = precompute_rope_frequencies(self.head_dim, max_seq_len, rope_base)
        self.register_buffer("cos", cos)
        self.register_buffer("sin", sin)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        batch, seq_len, dim = x.shape

        q = self.wq(x).view(batch, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
        k = self.wk(x).view(batch, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
        v = self.wv(x).view(batch, seq_len, self.num_heads, self.head_dim).transpose(1, 2)

        # Apply RoPE to Q and K
        cos = self.cos[:seq_len].unsqueeze(0).unsqueeze(0)  # (1, 1, seq, head_dim)
        sin = self.sin[:seq_len].unsqueeze(0).unsqueeze(0)

        q = apply_rope(q, cos, sin)
        k = apply_rope(k, cos, sin)

        # Flash attention (or standard)
        out = F.scaled_dot_product_attention(q, k, v, is_causal=True)
        out = out.transpose(1, 2).contiguous().view(batch, seq_len, dim)
        return self.wo(out)

The LLaMA family (LLaMA-1 through LLaMA-3), Mistral, Qwen-2, and most modern open-source models all use this RoPE-based pattern with minor variations in base frequency and scaling strategy.

RoPE and ALiBi: Relative Position Encodings

Why Relative Position Encodings

RoPE: Rotary Position Embeddings

The Math

RoPE Context Length Extension

ALiBi: Attention with Linear Biases

RoPE vs ALiBi Comparison

Practical: Implementing RoPE in a Full Attention Layer

Enjoyed this article?

Leave a comment