Why Positional Encoding?

The Permutation Problem

Attention is a set operation — it computes weighted sums over all tokens, but the weights depend only on content, not order:

Sentence A: "The dog bit the man"
Sentence B: "The man bit the dog"

Without positional encoding, a transformer sees the same bag of tokens
for both sentences. Attention scores between "dog" and "bit" are identical
in both cases — the model cannot distinguish them.

This is a fundamental issue: natural language meaning depends critically on word order. "dog bit man" and "man bit dog" describe opposite events.

Why Attention Has No Inherent Order

The attention computation:

score(i, j) = qᵢ · kⱼ / √dₖ

qᵢ = x_i · Wᴬ
kⱼ = x_j · Wᴷ

Here, x_i is the embedding at position i. The score between positions i and j depends only on the embedding values, not on i and j themselves. Permuting the sequence changes which embedding is at which position, but the attention operation itself doesn't know what i or j are.

What Breaks Without Position

Python

import torch
import torch.nn.functional as F

# Two sentences with identical tokens, different order
s1 = torch.stack([tok_dog, tok_bit, tok_man])  # dog bit man
s2 = torch.stack([tok_man, tok_bit, tok_dog])  # man bit dog

W_q = W_k = W_v = torch.eye(4)  # identity for illustration

# Attention for s1 and s2 — same multiset, different meaning
attn_s1 = F.softmax(s1 @ s1.T / 2, dim=-1) @ s1
attn_s2 = F.softmax(s2 @ s2.T / 2, dim=-1) @ s2

# The outputs are the same SET of vectors, just in different order
# A classifier reading position 0 would get different things,
# but only because the TOKEN is different — not because the model
# understands "position 0 is the subject"

Without positional signal, the model cannot learn that "the token at position 0 is the subject" vs "the token at position 2 is the object."

The Design Space

Several approaches inject position information:

Absolute positional encoding:
  Each position i gets a fixed or learned vector p_i
  Added to the token embedding: x_i + p_i
  Examples: sinusoidal (Vaswani et al.), learned (BERT, GPT-2)

Relative positional encoding:
  Encode the DISTANCE between positions i and j
  Added to attention scores: score(i, j) += bias(i - j)
  Examples: T5 relative bias, ALiBi

Rotary positional encoding (RoPE):
  Rotate Q and K vectors by an angle proportional to position
  Relative distance naturally falls out of the dot product
  Examples: LLaMA, Mistral, GPT-NeoX

Learned per-position vectors (absolute):
  Each position 0..max_len gets a learned embedding
  Most common in BERT, GPT-2

Key Properties of a Good Positional Encoding

Each position gets a unique encoding — the model can distinguish position 5 from position 6
Generalisable to unseen lengths — sinusoidal and RoPE can extrapolate; learned encodings cannot easily
Relative distances are learnable — the model should be able to learn "this token is 3 positions before that one"
Bounded magnitude — positional vectors shouldn't dominate the token embeddings

No single method dominates all criteria. The field has moved from sinusoidal → learned absolute → relative → RoPE/ALiBi over successive model generations.

Residual Token Identity Without Position

An interesting property: even without positional encoding, attention still produces useful output — it can aggregate information about what tokens are present. The limitation is purely relational: the model can't tell that "the subject comes before the verb" without position.

This is why BERT still functions on bag-of-words tasks without position — but fails on any task where order matters.

Interview Answer

"Transformers are permutation-equivariant by default — attention scores depend only on token content, not position. Without injecting positional information, the model sees 'dog bit man' and 'man bit dog' as the same bag of tokens. Positional encoding solves this by adding position-specific information to each token embedding. The design space ranges from fixed sinusoidal encodings (Vaswani et al.) to learned absolute positions (BERT/GPT-2) to relative and rotary encodings (RoPE, ALiBi) that encode distances rather than absolute positions."