Sinusoidal Positional Encoding Explained — Transformer Architecture Q&A | Learnixo

The Formula

The original Transformer (Vaswani et al., 2017) defines positional encodings as:

PE(pos, 2i)   = sin(pos / 10000^(2i / d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i / d_model))

where:
  pos     = token position (0, 1, 2, ...)
  i       = dimension index (0, 1, 2, ..., d_model/2 - 1)
  d_model = embedding dimension (512 in original paper)

Each position gets a unique d_model-dimensional vector. Even dimensions use sine, odd dimensions use cosine, with wavelengths ranging from 2π (i=0) to 20000π (i=d_model/2-1).

Intuition: Binary Clocks

Think of it like a binary counter where each bit position flips at a different rate:

Binary: position 0 = 0000, 1 = 0001, 2 = 0010, 3 = 0011, ...
         bit 0 (LSB): flips every step
         bit 1:       flips every 2 steps
         bit 2:       flips every 4 steps
         bit 3:       flips every 8 steps

Sinusoidal: similar but smooth and continuous
  dim 0,1:  high frequency (fast oscillation — local position)
  dim d-2,d-1: low frequency (slow oscillation — global position)

Low-dimensional components encode coarse position; high-dimensional components encode fine-grained position.

Why Sine and Cosine?

The sine/cosine choice has a useful property for relative positions:

PE(pos + k) can be expressed as a LINEAR FUNCTION of PE(pos):

PE(pos + k, 2i)   = PE(pos, 2i)   · cos(k·ω) + PE(pos, 2i+1) · sin(k·ω)
PE(pos + k, 2i+1) = PE(pos, 2i+1) · cos(k·ω) - PE(pos, 2i)   · sin(k·ω)

where ω = 1 / 10000^(2i/d_model)

This means: the encoding of position pos+k is a linear transformation of the encoding of position pos. The attention mechanism can therefore learn to detect relative offsets (k positions ahead/behind) by learning appropriate linear combinations.

Code

Python

import torch
import math

def sinusoidal_encoding(max_len: int, d_model: int) -> torch.Tensor:
    pe = torch.zeros(max_len, d_model)
    position = torch.arange(0, max_len).unsqueeze(1).float()

    # Compute the division term: 10000^(2i/d_model)
    div_term = torch.exp(
        torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model)
    )

    pe[:, 0::2] = torch.sin(position * div_term)  # even dims
    pe[:, 1::2] = torch.cos(position * div_term)  # odd dims

    return pe  # shape: (max_len, d_model)

# Usage: add to token embeddings before first encoder block
pe = sinusoidal_encoding(max_len=512, d_model=512)
x = token_embeddings + pe[:seq_len, :]

Properties

Advantages:

No learned parameters — fixed, doesn't add to parameter count
Extrapolates to unseen lengths — the formula works for any position, even beyond training
Unique per position — no two positions share an identical encoding
Relative position information — linear relationship between PE(pos) and PE(pos+k)

Limitations:

Not learned from data — might not be optimal for a specific task or domain
Additive — position information is added to the embedding; Q/K/V projections can distort it
Absolute, not relative — the model still sees absolute positions; relative distance must be learned indirectly
Extrapolation degrades in practice — modern LLMs fine-tuned with RoPE still suffer past the training context

Comparison with Learned Encodings

| Property | Sinusoidal | Learned | |----------|-----------|---------| | Parameters | 0 | max_len × d_model | | Extrapolation | In theory yes | No (unseen positions) | | Performance | Slightly worse | Slightly better (in-distribution) | | Used in | Original Transformer | BERT, GPT-2 |

Empirically, the difference is small within the training context window. Learned encodings win in-distribution; sinusoidal wins for length generalisation.

Interview Answer

"Sinusoidal positional encoding (Vaswani et al.) assigns each position a d_model-dimensional vector using sine and cosine at different frequencies: PE(pos, 2i) = sin(pos / 10000^(2i/d_model)), PE(pos, 2i+1) = cos(...). The key property is that PE(pos+k) is a linear function of PE(pos), so the model can learn to detect relative distances k. It requires no learned parameters and can extrapolate to unseen lengths in theory. Modern models have moved toward learned absolute (BERT) and rotary encodings (RoPE) for better empirical performance."