Positional Encoding

The Problem: Transformers Are Permutation-Invariant

Unlike RNNs, a transformer processes all tokens in parallel via attention. But the attention formula softmax(Q × K^T / sqrt(d_k)) × V is permutation-invariant: if you shuffle the input tokens, the output tokens shuffle identically. The model has no native sense of order.

Positional encoding injects position information into the token representations so that "dog bites man" and "man bites dog" produce different outputs.

Option 1: Sinusoidal Encoding (Original Transformer)

The original 2017 paper "Attention Is All You Need" used fixed sinusoidal functions:

PE(pos, 2i)     = sin( pos / 10000^(2i/d_model) )
PE(pos, 2i + 1) = cos( pos / 10000^(2i/d_model) )

Where:

pos is the token position (0, 1, 2, ...)
i is the dimension index (0 to d_model/2 - 1)
Even dimensions use sine, odd dimensions use cosine

Properties:

Each position gets a unique d_model-dimensional vector
Nearby positions have similar encodings (smooth)
The encoding generalises beyond training length (in theory)
No parameters to learn

Python

import numpy as np
import matplotlib.pyplot as plt


def sinusoidal_encoding(max_len: int, d_model: int) -> np.ndarray:
    """
    Returns positional encoding matrix of shape (max_len, d_model).
    """
    PE = np.zeros((max_len, d_model))
    positions = np.arange(max_len)[:, np.newaxis]        # (max_len, 1)
    dim_indices = np.arange(0, d_model, 2)[np.newaxis, :]  # (1, d_model/2)

    # Frequency denominator: 10000^(2i/d_model)
    div_term = np.power(10000.0, dim_indices / d_model)   # (1, d_model/2)

    PE[:, 0::2] = np.sin(positions / div_term)   # even dims
    PE[:, 1::2] = np.cos(positions / div_term)   # odd dims
    return PE


# Visualise
pe = sinusoidal_encoding(max_len=50, d_model=128)
print("Shape:", pe.shape)   # (50, 128)

plt.figure(figsize=(10, 4))
plt.imshow(pe, aspect='auto', cmap='RdBu', vmin=-1, vmax=1)
plt.xlabel("Dimension")
plt.ylabel("Position")
plt.title("Sinusoidal Positional Encoding")
plt.colorbar()
plt.tight_layout()
plt.savefig("sinusoidal_pe.png", dpi=150)
plt.show()

PyTorch Module

Python

import torch
import torch.nn as nn


class SinusoidalPositionalEncoding(nn.Module):
    """Fixed sinusoidal positional encoding added to token embeddings."""

    def __init__(self, d_model: int, max_len: int = 5000, dropout: float = 0.1):
        super().__init__()
        self.dropout = nn.Dropout(dropout)

        pe = torch.zeros(max_len, d_model)
        position = torch.arange(max_len).unsqueeze(1).float()
        div_term = torch.exp(
            torch.arange(0, d_model, 2).float() * (-np.log(10000.0) / d_model)
        )
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)   # (1, max_len, d_model) — batch broadcast

        self.register_buffer("pe", pe)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """x: (B, T, d_model) — add positional encoding"""
        x = x + self.pe[:, :x.size(1), :]
        return self.dropout(x)


# Test
pe_module = SinusoidalPositionalEncoding(d_model=256)
x = torch.randn(4, 20, 256)
out = pe_module(x)
print("Output shape:", out.shape)   # (4, 20, 256)

Option 2: Learned Positional Embeddings

Instead of a fixed formula, each position gets a learnable embedding vector. This is how BERT and GPT-2 handle positions.

Python

class LearnedPositionalEncoding(nn.Module):
    """Trainable position embedding table."""

    def __init__(self, d_model: int, max_len: int = 2048):
        super().__init__()
        # Shape: (max_len, d_model) — one vector per position
        self.position_embedding = nn.Embedding(max_len, d_model)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        B, T, _ = x.shape
        positions = torch.arange(T, device=x.device)   # (T,)
        return x + self.position_embedding(positions)   # broadcast over batch


# BERT-style embedding layer
class BERTEmbedding(nn.Module):
    def __init__(self, vocab_size: int, d_model: int, max_len: int = 512):
        super().__init__()
        self.token_emb = nn.Embedding(vocab_size, d_model)
        self.pos_emb   = nn.Embedding(max_len, d_model)
        self.norm      = nn.LayerNorm(d_model)
        self.dropout   = nn.Dropout(0.1)

    def forward(self, token_ids: torch.Tensor) -> torch.Tensor:
        B, T = token_ids.shape
        pos = torch.arange(T, device=token_ids.device)
        emb = self.token_emb(token_ids) + self.pos_emb(pos)
        return self.dropout(self.norm(emb))

Learned vs fixed comparison:

| Property | Sinusoidal | Learned | |----------|-----------|---------| | Parameters | 0 | max_len × d_model | | Extrapolates beyond max_len | Yes (theoretically) | No | | Fits training data better | No | Yes (memorises position) | | Used by | Original Transformer | BERT, GPT-2, GPT-3 |

Option 3: Rotary Positional Encoding (RoPE)

RoPE (Su et al. 2021) is the dominant approach in modern LLMs (LLaMA, Mistral, Qwen, Gemma). Instead of adding a positional vector to the embedding, RoPE rotates the query and key vectors by an angle proportional to their position.

The key insight: for dot-product attention q · k, if we rotate q by angle m*θ and k by angle n*θ, then:

RoPE(q, m) · RoPE(k, n) = f(q, k, m - n)

The dot product depends only on the relative position m - n, not absolute positions. This naturally gives attention relative positional bias.

Python

def rotate_half(x: torch.Tensor) -> torch.Tensor:
    """Rotate pairs of dimensions by 90 degrees."""
    x1, x2 = x[..., : x.shape[-1] // 2], x[..., x.shape[-1] // 2 :]
    return torch.cat([-x2, x1], dim=-1)


def apply_rope(q: torch.Tensor, k: torch.Tensor, cos: torch.Tensor, sin: torch.Tensor):
    """
    Apply rotary positional encoding to Q and K.

    q, k: (B, h, T, d_k)
    cos, sin: (T, d_k) — precomputed rotation angles
    """
    cos = cos.unsqueeze(0).unsqueeze(0)   # (1, 1, T, d_k)
    sin = sin.unsqueeze(0).unsqueeze(0)

    q_rot = q * cos + rotate_half(q) * sin
    k_rot = k * cos + rotate_half(k) * sin
    return q_rot, k_rot


def build_rope_cache(max_len: int, d_k: int, base: float = 10000.0):
    """Precompute cos and sin rotation matrices."""
    theta = 1.0 / (base ** (torch.arange(0, d_k, 2).float() / d_k))
    positions = torch.arange(max_len).float()
    freqs = torch.outer(positions, theta)   # (max_len, d_k/2)
    freqs = torch.cat([freqs, freqs], dim=-1)   # (max_len, d_k)
    return freqs.cos(), freqs.sin()


# Quick test
B, h, T, d_k = 2, 8, 16, 64
cos_cache, sin_cache = build_rope_cache(T, d_k)
q = torch.randn(B, h, T, d_k)
k = torch.randn(B, h, T, d_k)
q_rot, k_rot = apply_rope(q, k, cos_cache, sin_cache)
print("Rotated Q shape:", q_rot.shape)   # (2, 8, 16, 64)

Option 4: ALiBi (Attention with Linear Biases)

ALiBi (Press et al. 2022) adds a negative bias to attention logits proportional to the distance between tokens:

attn_logit(i, j) = q_i · k_j / sqrt(d_k)  -  slope_h × |i - j|

Each head h gets a different slope slope_h. Closer tokens get less penalty; farther tokens get more. No positional vectors are added to embeddings at all.

Python

def build_alibi_slopes(n_heads: int) -> torch.Tensor:
    """Compute per-head ALiBi slopes."""
    # Geometric sequence: 2^(-8/n), 2^(-16/n), ...
    m = torch.arange(1, n_heads + 1, dtype=torch.float32)
    slopes = 2 ** (-8.0 * m / n_heads)
    return slopes   # (n_heads,)


def alibi_bias(seq_len: int, n_heads: int) -> torch.Tensor:
    """
    Returns ALiBi bias tensor of shape (n_heads, seq_len, seq_len).
    bias[h, i, j] = -slope_h * |i - j|
    """
    slopes = build_alibi_slopes(n_heads)   # (h,)
    positions = torch.arange(seq_len, dtype=torch.float32)
    distances = (positions.unsqueeze(0) - positions.unsqueeze(1)).abs()  # (T, T)
    # (h, 1, 1) × (1, T, T) → (h, T, T)
    bias = -slopes.view(-1, 1, 1) * distances.unsqueeze(0)
    return bias


# Test
bias = alibi_bias(seq_len=10, n_heads=8)
print("ALiBi bias shape:", bias.shape)   # (8, 10, 10)
print("Head 0, row 5:", bias[0, 5, :])  # Largest negative values far from pos 5

Comparison Summary

| Method | Added to | Relative? | Extrapolates? | Used by | |--------|----------|-----------|---------------|---------| | Sinusoidal | Embeddings | No | Partially | Transformer (2017) | | Learned | Embeddings | No | No | BERT, GPT-2 | | RoPE | Q and K only | Yes | Yes (with tricks) | LLaMA, Mistral, Qwen | | ALiBi | Attention logits | Yes | Yes | BLOOM, MPT |

YaRN: Extending Context with RoPE

Models trained with RoPE at 4k context often struggle beyond that window. YaRN (Peng et al. 2023) rescales RoPE frequencies to extend context:

Python

def build_yarn_rope_cache(
    max_len: int,
    d_k: int,
    original_max_len: int = 4096,
    scale: float = 16.0,
    base: float = 10000.0,
):
    """
    YaRN rescales the base frequency to extend context window.
    scale = new_context / original_context
    """
    # Scale up the base temperature
    new_base = base * (scale ** (d_k / (d_k - 2)))
    theta = 1.0 / (new_base ** (torch.arange(0, d_k, 2).float() / d_k))
    positions = torch.arange(max_len).float()
    freqs = torch.outer(positions, theta)
    freqs = torch.cat([freqs, freqs], dim=-1)
    return freqs.cos(), freqs.sin()


cos_yarn, sin_yarn = build_yarn_rope_cache(max_len=65536, d_k=128)
print("YaRN cache shape:", cos_yarn.shape)   # (65536, 128)

Key Takeaways

Transformers need explicit position info because attention is permutation-invariant.
Sinusoidal encodings are fixed, parameter-free, and theoretically extrapolatable.
Learned encodings fit training data better but fail at unseen lengths.
RoPE rotates Q and K so that their dot product reflects relative distance — the current standard in LLMs.
ALiBi biases attention logits by distance — simple and effective for length generalisation.
YaRN and similar methods allow extending the effective context window of RoPE models without full retraining.

Positional Encoding

The Problem: Transformers Are Permutation-Invariant

Option 1: Sinusoidal Encoding (Original Transformer)

PyTorch Module

Option 2: Learned Positional Embeddings

Option 3: Rotary Positional Encoding (RoPE)

Option 4: ALiBi (Attention with Linear Biases)

Comparison Summary

YaRN: Extending Context with RoPE

Key Takeaways

Enjoyed this article?

Leave a comment