Transformer Architecture Q&A · Lesson 12 of 23
Sinusoidal Positional Encoding Explained
The Formula
The original Transformer (Vaswani et al., 2017) defines positional encodings as:
PE(pos, 2i) = sin(pos / 10000^(2i / d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i / d_model))
where:
pos = token position (0, 1, 2, ...)
i = dimension index (0, 1, 2, ..., d_model/2 - 1)
d_model = embedding dimension (512 in original paper)Each position gets a unique d_model-dimensional vector. Even dimensions use sine, odd dimensions use cosine, with wavelengths ranging from 2π (i=0) to 20000π (i=d_model/2-1).
Intuition: Binary Clocks
Think of it like a binary counter where each bit position flips at a different rate:
Binary: position 0 = 0000, 1 = 0001, 2 = 0010, 3 = 0011, ...
bit 0 (LSB): flips every step
bit 1: flips every 2 steps
bit 2: flips every 4 steps
bit 3: flips every 8 steps
Sinusoidal: similar but smooth and continuous
dim 0,1: high frequency (fast oscillation — local position)
dim d-2,d-1: low frequency (slow oscillation — global position)Low-dimensional components encode coarse position; high-dimensional components encode fine-grained position.
Why Sine and Cosine?
The sine/cosine choice has a useful property for relative positions:
PE(pos + k) can be expressed as a LINEAR FUNCTION of PE(pos):
PE(pos + k, 2i) = PE(pos, 2i) · cos(k·ω) + PE(pos, 2i+1) · sin(k·ω)
PE(pos + k, 2i+1) = PE(pos, 2i+1) · cos(k·ω) - PE(pos, 2i) · sin(k·ω)
where ω = 1 / 10000^(2i/d_model)This means: the encoding of position pos+k is a linear transformation of the encoding of position pos. The attention mechanism can therefore learn to detect relative offsets (k positions ahead/behind) by learning appropriate linear combinations.
Code
import torch
import math
def sinusoidal_encoding(max_len: int, d_model: int) -> torch.Tensor:
pe = torch.zeros(max_len, d_model)
position = torch.arange(0, max_len).unsqueeze(1).float()
# Compute the division term: 10000^(2i/d_model)
div_term = torch.exp(
torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model)
)
pe[:, 0::2] = torch.sin(position * div_term) # even dims
pe[:, 1::2] = torch.cos(position * div_term) # odd dims
return pe # shape: (max_len, d_model)
# Usage: add to token embeddings before first encoder block
pe = sinusoidal_encoding(max_len=512, d_model=512)
x = token_embeddings + pe[:seq_len, :]Properties
Advantages:
- No learned parameters — fixed, doesn't add to parameter count
- Extrapolates to unseen lengths — the formula works for any position, even beyond training
- Unique per position — no two positions share an identical encoding
- Relative position information — linear relationship between PE(pos) and PE(pos+k)
Limitations:
- Not learned from data — might not be optimal for a specific task or domain
- Additive — position information is added to the embedding; Q/K/V projections can distort it
- Absolute, not relative — the model still sees absolute positions; relative distance must be learned indirectly
- Extrapolation degrades in practice — modern LLMs fine-tuned with RoPE still suffer past the training context
Comparison with Learned Encodings
| Property | Sinusoidal | Learned | |----------|-----------|---------| | Parameters | 0 | max_len × d_model | | Extrapolation | In theory yes | No (unseen positions) | | Performance | Slightly worse | Slightly better (in-distribution) | | Used in | Original Transformer | BERT, GPT-2 |
Empirically, the difference is small within the training context window. Learned encodings win in-distribution; sinusoidal wins for length generalisation.
Interview Answer
"Sinusoidal positional encoding (Vaswani et al.) assigns each position a d_model-dimensional vector using sine and cosine at different frequencies: PE(pos, 2i) = sin(pos / 10000^(2i/d_model)), PE(pos, 2i+1) = cos(...). The key property is that PE(pos+k) is a linear function of PE(pos), so the model can learn to detect relative distances k. It requires no learned parameters and can extrapolate to unseen lengths in theory. Modern models have moved toward learned absolute (BERT) and rotary encodings (RoPE) for better empirical performance."