Why Positional Encoding?
Why transformers are position-agnostic by default, what breaks without positional information, and the design space for injecting position into attention-based models.
The Permutation Problem
Attention is a set operation — it computes weighted sums over all tokens, but the weights depend only on content, not order:
Sentence A: "The dog bit the man"
Sentence B: "The man bit the dog"
Without positional encoding, a transformer sees the same bag of tokens
for both sentences. Attention scores between "dog" and "bit" are identical
in both cases — the model cannot distinguish them.This is a fundamental issue: natural language meaning depends critically on word order. "dog bit man" and "man bit dog" describe opposite events.
Why Attention Has No Inherent Order
The attention computation:
score(i, j) = qᵢ · kⱼ / √dₖ
qᵢ = x_i · Wᴬ
kⱼ = x_j · WᴷHere, x_i is the embedding at position i. The score between positions i and j depends only on the embedding values, not on i and j themselves. Permuting the sequence changes which embedding is at which position, but the attention operation itself doesn't know what i or j are.
What Breaks Without Position
import torch
import torch.nn.functional as F
# Two sentences with identical tokens, different order
s1 = torch.stack([tok_dog, tok_bit, tok_man]) # dog bit man
s2 = torch.stack([tok_man, tok_bit, tok_dog]) # man bit dog
W_q = W_k = W_v = torch.eye(4) # identity for illustration
# Attention for s1 and s2 — same multiset, different meaning
attn_s1 = F.softmax(s1 @ s1.T / 2, dim=-1) @ s1
attn_s2 = F.softmax(s2 @ s2.T / 2, dim=-1) @ s2
# The outputs are the same SET of vectors, just in different order
# A classifier reading position 0 would get different things,
# but only because the TOKEN is different — not because the model
# understands "position 0 is the subject"Without positional signal, the model cannot learn that "the token at position 0 is the subject" vs "the token at position 2 is the object."
The Design Space
Several approaches inject position information:
Absolute positional encoding:
Each position i gets a fixed or learned vector p_i
Added to the token embedding: x_i + p_i
Examples: sinusoidal (Vaswani et al.), learned (BERT, GPT-2)
Relative positional encoding:
Encode the DISTANCE between positions i and j
Added to attention scores: score(i, j) += bias(i - j)
Examples: T5 relative bias, ALiBi
Rotary positional encoding (RoPE):
Rotate Q and K vectors by an angle proportional to position
Relative distance naturally falls out of the dot product
Examples: LLaMA, Mistral, GPT-NeoX
Learned per-position vectors (absolute):
Each position 0..max_len gets a learned embedding
Most common in BERT, GPT-2Key Properties of a Good Positional Encoding
- Each position gets a unique encoding — the model can distinguish position 5 from position 6
- Generalisable to unseen lengths — sinusoidal and RoPE can extrapolate; learned encodings cannot easily
- Relative distances are learnable — the model should be able to learn "this token is 3 positions before that one"
- Bounded magnitude — positional vectors shouldn't dominate the token embeddings
No single method dominates all criteria. The field has moved from sinusoidal → learned absolute → relative → RoPE/ALiBi over successive model generations.
Residual Token Identity Without Position
An interesting property: even without positional encoding, attention still produces useful output — it can aggregate information about what tokens are present. The limitation is purely relational: the model can't tell that "the subject comes before the verb" without position.
This is why BERT still functions on bag-of-words tasks without position — but fails on any task where order matters.
Interview Answer
"Transformers are permutation-equivariant by default — attention scores depend only on token content, not position. Without injecting positional information, the model sees 'dog bit man' and 'man bit dog' as the same bag of tokens. Positional encoding solves this by adding position-specific information to each token embedding. The design space ranges from fixed sinusoidal encodings (Vaswani et al.) to learned absolute positions (BERT/GPT-2) to relative and rotary encodings (RoPE, ALiBi) that encode distances rather than absolute positions."
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.