Learnixo
Back to blog
AI Systemsintermediate

Sinusoidal Positional Encoding

How the original Transformer injects position with sine and cosine functions, why that design encodes relative distance, and what its limitations are.

Asma Hafeez KhanMay 16, 20264 min read
TransformersPositional EncodingSinusoidalArchitectureInterview
Share:𝕏

The Formula

The original Transformer (Vaswani et al., 2017) defines positional encodings as:

PE(pos, 2i)   = sin(pos / 10000^(2i / d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i / d_model))

where:
  pos     = token position (0, 1, 2, ...)
  i       = dimension index (0, 1, 2, ..., d_model/2 - 1)
  d_model = embedding dimension (512 in original paper)

Each position gets a unique d_model-dimensional vector. Even dimensions use sine, odd dimensions use cosine, with wavelengths ranging from 2π (i=0) to 20000π (i=d_model/2-1).


Intuition: Binary Clocks

Think of it like a binary counter where each bit position flips at a different rate:

Binary: position 0 = 0000, 1 = 0001, 2 = 0010, 3 = 0011, ...
         bit 0 (LSB): flips every step
         bit 1:       flips every 2 steps
         bit 2:       flips every 4 steps
         bit 3:       flips every 8 steps

Sinusoidal: similar but smooth and continuous
  dim 0,1:  high frequency (fast oscillation — local position)
  dim d-2,d-1: low frequency (slow oscillation — global position)

Low-dimensional components encode coarse position; high-dimensional components encode fine-grained position.


Why Sine and Cosine?

The sine/cosine choice has a useful property for relative positions:

PE(pos + k) can be expressed as a LINEAR FUNCTION of PE(pos):

PE(pos + k, 2i)   = PE(pos, 2i)   · cos(k·ω) + PE(pos, 2i+1) · sin(k·ω)
PE(pos + k, 2i+1) = PE(pos, 2i+1) · cos(k·ω) - PE(pos, 2i)   · sin(k·ω)

where ω = 1 / 10000^(2i/d_model)

This means: the encoding of position pos+k is a linear transformation of the encoding of position pos. The attention mechanism can therefore learn to detect relative offsets (k positions ahead/behind) by learning appropriate linear combinations.


Code

Python
import torch
import math

def sinusoidal_encoding(max_len: int, d_model: int) -> torch.Tensor:
    pe = torch.zeros(max_len, d_model)
    position = torch.arange(0, max_len).unsqueeze(1).float()

    # Compute the division term: 10000^(2i/d_model)
    div_term = torch.exp(
        torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model)
    )

    pe[:, 0::2] = torch.sin(position * div_term)  # even dims
    pe[:, 1::2] = torch.cos(position * div_term)  # odd dims

    return pe  # shape: (max_len, d_model)

# Usage: add to token embeddings before first encoder block
pe = sinusoidal_encoding(max_len=512, d_model=512)
x = token_embeddings + pe[:seq_len, :]

Properties

Advantages:

  • No learned parameters — fixed, doesn't add to parameter count
  • Extrapolates to unseen lengths — the formula works for any position, even beyond training
  • Unique per position — no two positions share an identical encoding
  • Relative position information — linear relationship between PE(pos) and PE(pos+k)

Limitations:

  • Not learned from data — might not be optimal for a specific task or domain
  • Additive — position information is added to the embedding; Q/K/V projections can distort it
  • Absolute, not relative — the model still sees absolute positions; relative distance must be learned indirectly
  • Extrapolation degrades in practice — modern LLMs fine-tuned with RoPE still suffer past the training context

Comparison with Learned Encodings

| Property | Sinusoidal | Learned | |----------|-----------|---------| | Parameters | 0 | max_len × d_model | | Extrapolation | In theory yes | No (unseen positions) | | Performance | Slightly worse | Slightly better (in-distribution) | | Used in | Original Transformer | BERT, GPT-2 |

Empirically, the difference is small within the training context window. Learned encodings win in-distribution; sinusoidal wins for length generalisation.


Interview Answer

"Sinusoidal positional encoding (Vaswani et al.) assigns each position a d_model-dimensional vector using sine and cosine at different frequencies: PE(pos, 2i) = sin(pos / 10000^(2i/d_model)), PE(pos, 2i+1) = cos(...). The key property is that PE(pos+k) is a linear function of PE(pos), so the model can learn to detect relative distances k. It requires no learned parameters and can extrapolate to unseen lengths in theory. Modern models have moved toward learned absolute (BERT) and rotary encodings (RoPE) for better empirical performance."

Enjoyed this article?

Explore the AI Systems learning path for more.

Found this helpful?

Share:𝕏

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.