Interview Q&A: Positional Encoding

Q: Why do transformers need positional encoding?

Self-attention is a set operation — it computes weighted sums over all tokens with weights that depend only on token content, not order. Without positional information, the model cannot distinguish "dog bit man" from "man bit dog" because the same tokens are present, just reordered.

Positional encoding injects order information by adding position-specific signals to the token representations before attention is computed.

Q: Describe the four main approaches to positional encoding.

1. Sinusoidal (Vaswani et al., 2017):
   Fixed formula: PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
   No learned parameters. Extrapolates in theory. PE(pos+k) is a linear function of PE(pos).
   Used in: original Transformer

2. Learned absolute (BERT, GPT-2):
   Trainable embedding table E_pos ∈ ℝ^(max_len × d_model)
   Outperforms sinusoidal in-distribution. Cannot extrapolate.
   Used in: BERT (max 512), GPT-2 (max 1024)

3. Rotary (RoPE, LLaMA, Mistral):
   Rotate Q and K by angles proportional to position.
   Dot product depends only on relative distance (i-j).
   No parameters. Extrapolates better. Dominant in modern open-source LLMs.

4. ALiBi (MPT, BLOOM):
   Add linear penalty -m·(i-j) to attention scores.
   No embedding at input level. Strong extrapolation at inference.
   Different slope per head → automatic local/global specialisation.

Q: What is the key advantage of RoPE over learned absolute positions?

Learned absolute positions cannot extrapolate — positions beyond max_len are unseen during training, producing garbage output. RoPE encodes position by rotating vectors, so the dot product between query at position i and key at position j depends only on (i-j). This relative distance encoding scales naturally to longer sequences.

Additionally, RoPE requires zero positional parameters (unlike a learned embedding table of max_len × d_model entries).

Q: What is the key advantage of ALiBi over RoPE?

ALiBi requires no additional computation at the embedding stage — it adds a precomputed bias matrix to attention scores. It extrapolates strongly because the bias is purely a function of distance, and larger distances just get a larger linear penalty. Models trained with ALiBi often handle 2-4× their training context length at inference with graceful degradation, without any fine-tuning.

RoPE extrapolates better than learned absolute but still degrades past the training context. YaRN and LongRoPE extend RoPE's range through fine-tuning.

Q: What happens to a model with learned positional embeddings when you feed it a sequence longer than max_len?

The position indices beyond max_len have never been seen during training. The corresponding embedding table rows are either uninitialised or zero. The model receives garbage positional information, causing unpredictable output — typically coherent-looking but factually and structurally wrong text.

Common mitigations:

Truncate input to max_len (lose information)
Interpolate existing position embeddings (degrade quality)
Fine-tune on longer sequences (expensive, requires data)
Switch architecture to use RoPE or ALiBi

Q: How does the sinusoidal encoding relate to relative position?

The sinusoidal encoding has the mathematical property that PE(pos+k) = M_k · PE(pos), where M_k is a fixed rotation matrix depending only on k (the offset). This means the positional encoding at position pos+k is a linear function of the encoding at position pos, with a transform that depends only on the relative distance k, not on the absolute positions.

The attention mechanism can therefore learn to detect a specific relative offset k by learning to apply M_k's inverse — though in practice this is learned implicitly through training.

Q: In clinical NLP, does positional encoding matter more or less than in general text?

More, in specific ways:

Clinical notes often have structured sections (Chief Complaint, History of Present Illness, Assessment & Plan) where position within the document encodes semantic role
Long documents (discharge summaries, progress notes) can exceed standard context windows — long-context positional encoding becomes important
Temporal ordering within a note matters: a symptom mentioned at position 0 vs position 500 may carry different weight
De-identification (masking PHI) requires accurate token-level position tracking

ClinicalBERT uses BERT's learned absolute positions (max 512), limiting it to shorter note excerpts. Models fine-tuned with RoPE can handle full discharge summaries.

Interview Answer Template

"Transformers have no inherent notion of order — self-attention is permutation-equivariant. Positional encoding adds position information by injecting it into token representations before attention. Sinusoidal encoding uses fixed sine/cosine functions with a linear-transform property for relative positions. Learned absolute embeddings are better in-distribution but can't extrapolate. RoPE rotates Q/K by position-dependent angles so relative distance is encoded in the dot product — no parameters, better generalisation. ALiBi adds a linear distance penalty to attention scores directly, extrapolating gracefully. Modern LLMs (LLaMA, Mistral) use RoPE; some (MPT) use ALiBi."