Learnixo
Back to blog
AI Systemsintermediate

RoPE and ALiBi: Relative Position Encodings

How Rotary Position Embeddings (RoPE) and Attention with Linear Biases (ALiBi) encode relative position, enabling length generalization beyond training context.

Asma Hafeez KhanMay 16, 20267 min read
TransformersRoPEALiBiPositional Encoding
Share:𝕏

Why Relative Position Encodings

Learned absolute position embeddings fail when sequences exceed training length — there are no embeddings for positions beyond max_seq_len. Sinusoidal encodings generalize better but still use absolute position values that are out-of-distribution at inference.

Relative position encodings are more robust: they encode distance between tokens rather than where a token is. A model that learned "nearby tokens should attend more strongly" generalizes this at any length.


RoPE: Rotary Position Embeddings

RoPE (Su et al., 2021) applies a rotation to the query and key vectors before computing attention. The rotation angle depends on position — so when Q at position m and K at position n interact via dot product, the result automatically encodes the relative position (m - n).

The Math

For a pair of dimensions (i, i+1) in the query/key vector, RoPE rotates by angle θ_i × position:

Q'_i = Q_i × cos(m × θ_i) - Q_{i+1} × sin(m × θ_i)
Q'_{i+1} = Q_i × sin(m × θ_i) + Q_{i+1} × cos(m × θ_i)

where θ_i = 1 / base^(2i/d) with base = 10000 (or 500000 in LLaMA-3).

When we compute Q'_m · K'_n, the rotation angles for m and n combine into a rotation by (m-n) × θ_i — encoding relative position.

Python
import torch
import math

def precompute_rope_frequencies(
    head_dim: int,
    max_seq_len: int,
    base: float = 10000.0,
    device: str = "cpu",
) -> tuple[torch.Tensor, torch.Tensor]:
    """Precompute cos and sin for RoPE."""
    # Frequency for each pair of dimensions
    # θ_i = 1 / base^(2i / head_dim)
    theta = 1.0 / (base ** (torch.arange(0, head_dim, 2, device=device).float() / head_dim))

    # Position indices [0, 1, 2, ..., max_seq_len-1]
    positions = torch.arange(max_seq_len, device=device).float()

    # Outer product: each position × each frequency
    freqs = torch.outer(positions, theta)  # (max_seq_len, head_dim/2)

    # Concatenate for paired dimensions
    freqs = torch.cat([freqs, freqs], dim=-1)  # (max_seq_len, head_dim)

    return freqs.cos(), freqs.sin()

def apply_rope(
    x: torch.Tensor,  # (batch, seq_len, num_heads, head_dim)
    cos: torch.Tensor,  # (seq_len, head_dim)
    sin: torch.Tensor,  # (seq_len, head_dim)
) -> torch.Tensor:
    """Apply rotary position embeddings to a query or key tensor."""
    # Rotate pairs of dimensions
    x1 = x[..., 0::2]  # Even dims: (batch, seq, heads, head_dim/2)
    x2 = x[..., 1::2]  # Odd dims

    # Interleave to match cos/sin shape
    x_rotated = torch.cat([-x2, x1], dim=-1)

    # Apply rotation: x * cos + rotate(x) * sin
    return x * cos + x_rotated * sin

# Example usage
head_dim = 128
max_seq_len = 8192

cos, sin = precompute_rope_frequencies(head_dim, max_seq_len, base=500000.0)

# In attention: apply to Q and K before dot product
batch, seq_len, n_heads = 2, 100, 32
q = torch.randn(batch, seq_len, n_heads, head_dim)
k = torch.randn(batch, seq_len, n_heads, head_dim)

q_rope = apply_rope(q, cos[:seq_len], sin[:seq_len])
k_rope = apply_rope(k, cos[:seq_len], sin[:seq_len])

RoPE Context Length Extension

The base frequency determines how quickly the rotation completes one full cycle. A higher base means slower rotation — effectively lower frequencies that can represent longer ranges:

Python
def ntk_scaling(
    original_base: float,
    scale_factor: float,
    head_dim: int,
) -> float:
    """
    NTK-aware scaling: adjust base to support scale_factor × longer context.
    Used to extend RoPE beyond training length.
    """
    return original_base * (scale_factor ** (head_dim / (head_dim - 2)))

# LLaMA-2: base=10000, max_seq=4096
# To support 32k context ( extension):
new_base = ntk_scaling(10000.0, 8.0, 128)
print(f"NTK base for 32k context: {new_base:.0f}")  # ~641,878

# LLaMA-3 uses base=500000 natively (accounts for extended training)

YaRN (Yet Another RoPE eNhancement): More sophisticated than NTK. Applies different scaling factors to different frequency components — high frequencies (local attention) are less stretched than low frequencies (global attention). Used by Mistral for 32k context.


ALiBi: Attention with Linear Biases

ALiBi (Press et al., 2022) takes a completely different approach: instead of modifying embeddings, add a fixed negative bias proportional to distance directly to the attention scores before softmax.

attention_score(i, j) = (Q_i · K_j) / sqrt(d_k) - slope × |i - j|

Closer tokens get less penalty; farther tokens get larger penalties. The bias is always negative, so attention naturally focuses more on nearby tokens.

Python
import torch
import torch.nn.functional as F
import math

def get_alibi_slopes(num_heads: int) -> torch.Tensor:
    """
    Compute per-head slopes for ALiBi.
    Different heads use different slopes → different attention ranges.
    """
    # Slopes: 2^(-8/n) for n = 1, 2, ..., num_heads
    # Equivalent to: m = {1/2^(8k/n)} for k = 1..n
    def get_slopes_power_of_2(n):
        start = 2 ** (-(2 ** -(math.log2(n) - 3)))
        ratio = start
        return [start * ratio ** i for i in range(n)]

    if math.log2(num_heads).is_integer():
        slopes = get_slopes_power_of_2(num_heads)
    else:
        # For non-power-of-2 num_heads: closest power of 2 + interpolation
        n = 2 ** math.floor(math.log2(num_heads))
        slopes = get_slopes_power_of_2(n)
        slopes_rest = get_slopes_power_of_2(2 * n)[0::2][: num_heads - n]
        slopes = slopes + slopes_rest

    return torch.tensor(slopes, dtype=torch.float32)

def alibi_attention(
    q: torch.Tensor,  # (batch, heads, seq_len, head_dim)
    k: torch.Tensor,
    v: torch.Tensor,
    slopes: torch.Tensor,  # (num_heads,)
) -> torch.Tensor:
    """Attention with ALiBi biases."""
    batch, heads, seq_len, head_dim = q.shape
    scale = head_dim ** -0.5

    # Standard attention scores
    scores = torch.matmul(q, k.transpose(-2, -1)) * scale  # (batch, heads, seq, seq)

    # Causal mask
    mask = torch.triu(torch.ones(seq_len, seq_len, device=q.device), diagonal=1).bool()
    scores = scores.masked_fill(mask, float("-inf"))

    # ALiBi bias: -slope × distance
    # Distance matrix: |position_i - position_j| for causal attention is just (i - j) for i >= j
    positions = torch.arange(seq_len, device=q.device)
    distance = positions.unsqueeze(0) - positions.unsqueeze(1)  # (seq, seq)
    distance = torch.clamp(distance, max=0).abs()  # Causal: future positions have distance 0 (masked anyway)

    # Apply per-head slopes: (heads, 1, 1) × (1, seq, seq)
    alibi_bias = -slopes.view(heads, 1, 1) * distance.unsqueeze(0)  # (heads, seq, seq)
    scores = scores + alibi_bias.unsqueeze(0)  # Broadcast over batch

    # Softmax and output
    attn = F.softmax(scores, dim=-1)
    return torch.matmul(attn, v)

# Example
num_heads = 8
slopes = get_alibi_slopes(num_heads)
print("ALiBi slopes:", slopes.tolist())
# Different heads attend at different ranges: first head attends locally, last head more globally

RoPE vs ALiBi Comparison

| Property | RoPE | ALiBi | |---|---|---| | Where applied | Rotates Q and K vectors | Additive bias on attention scores | | Learnable parameters | No | No | | Length extrapolation | Needs scaling (NTK/YaRN) | Extrapolates well beyond training length | | Quality at training length | Excellent | Excellent | | Quality beyond training length | Good with scaling | Good (bias keeps decaying) | | Used by | LLaMA, Mistral, Qwen, Gemma | MPT, BLOOM, Falcon | | Multi-query compatibility | Yes | Yes |

ALiBi's extrapolation advantage: Because ALiBi uses a simple linear penalty, it naturally discourages attention to very distant tokens — at any sequence length, not just those seen in training. The model learns "nearby tokens matter more" and this principle holds at any scale.

RoPE's quality advantage: At the training length, RoPE generally outperforms ALiBi — the rotation-based encoding carries richer positional information than a scalar bias.


Practical: Implementing RoPE in a Full Attention Layer

Python
class RoPEAttention(torch.nn.Module):
    def __init__(self, dim: int, num_heads: int, max_seq_len: int = 8192, rope_base: float = 500000.0):
        super().__init__()
        self.num_heads = num_heads
        self.head_dim = dim // num_heads

        self.wq = torch.nn.Linear(dim, dim, bias=False)
        self.wk = torch.nn.Linear(dim, dim, bias=False)
        self.wv = torch.nn.Linear(dim, dim, bias=False)
        self.wo = torch.nn.Linear(dim, dim, bias=False)

        # Precompute frequencies (cached, not parameters)
        cos, sin = precompute_rope_frequencies(self.head_dim, max_seq_len, rope_base)
        self.register_buffer("cos", cos)
        self.register_buffer("sin", sin)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        batch, seq_len, dim = x.shape

        q = self.wq(x).view(batch, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
        k = self.wk(x).view(batch, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
        v = self.wv(x).view(batch, seq_len, self.num_heads, self.head_dim).transpose(1, 2)

        # Apply RoPE to Q and K
        cos = self.cos[:seq_len].unsqueeze(0).unsqueeze(0)  # (1, 1, seq, head_dim)
        sin = self.sin[:seq_len].unsqueeze(0).unsqueeze(0)

        q = apply_rope(q, cos, sin)
        k = apply_rope(k, cos, sin)

        # Flash attention (or standard)
        out = F.scaled_dot_product_attention(q, k, v, is_causal=True)
        out = out.transpose(1, 2).contiguous().view(batch, seq_len, dim)
        return self.wo(out)

The LLaMA family (LLaMA-1 through LLaMA-3), Mistral, Qwen-2, and most modern open-source models all use this RoPE-based pattern with minor variations in base frequency and scaling strategy.

Enjoyed this article?

Explore the AI Systems learning path for more.

Found this helpful?

Share:𝕏

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.