Positional Encoding in LLMs
How positional encoding in production LLMs differs from the original Transformer — RoPE details, context length extension, and practical limits of each approach.
Evolution in Production LLMs
GPT-2 (2019): Learned absolute, max 1024 tokens
GPT-3 (2020): Learned absolute, max 2048 tokens
BERT (2018): Learned absolute, max 512 tokens
T5 (2019): Relative bias (T5 relative attention), max 512 default
LLaMA 1 (2023): RoPE, max 2048 tokens
LLaMA 2 (2023): RoPE, max 4096 tokens
LLaMA 3 (2024): RoPE, max 8192 (base), 128K (extended)
Mistral 7B: RoPE + sliding window, max 32K effective
GPT-4: Unknown (claimed 128K context)
Claude 3: Unknown (200K context)Context length has grown ~250× from GPT-2 to frontier models in 5 years, driven by architectural improvements + long-context fine-tuning.
RoPE in LLMs: Practical Details
RoPE applies rotation to Q and K at each attention layer, using precomputed frequency tables:
import torch
import math
def precompute_freqs_cis(head_dim: int, max_seq_len: int, base: float = 10000.0):
theta = 1.0 / (base ** (torch.arange(0, head_dim, 2).float() / head_dim))
t = torch.arange(max_seq_len)
freqs = torch.outer(t, theta)
return torch.polar(torch.ones_like(freqs), freqs) # complex numbers
def apply_rotary_emb(xq, xk, freqs_cis):
# xq, xk: (batch, seq_len, heads, head_dim)
xq_ = torch.view_as_complex(xq.float().reshape(*xq.shape[:-1], -1, 2))
xk_ = torch.view_as_complex(xk.float().reshape(*xk.shape[:-1], -1, 2))
freqs_cis = freqs_cis[:xq_.shape[1]] # trim to actual seq len
xq_out = torch.view_as_real(xq_ * freqs_cis).flatten(3)
xk_out = torch.view_as_real(xk_ * freqs_cis).flatten(3)
return xq_out.type_as(xq), xk_out.type_as(xk)The LLaMA implementation uses complex number multiplication for efficiency — equivalent to the 2D rotation matrix formulation.
Context Length Extension
When a model trained on 4096 tokens is asked to process 16K tokens, RoPE frequencies enter unseen territory. Several techniques extend context:
1. Position interpolation (Chen et al., 2023):
Scale positions by factor s = training_len / inference_len
Positions 0..16383 → 0..4095 (compressed)
Fine-tune on long documents for ~1000 steps
Works: LLaMA 2 extended to 32K this way
2. YaRN (Peng et al., 2023):
Scale different frequency dimensions differently
High-frequency dims: no scaling (good short-range)
Low-frequency dims: interpolate (good long-range)
Better perplexity than uniform interpolation
Used: Mistral 7B v0.2 (32K), Yarn-LLaMA-2 (128K)
3. LongRoPE (Microsoft, 2024):
Evolutionary search for optimal frequency scaling
LLaMA up to 2M tokens with minimal quality lossThe Perplexity Cliff
Without position interpolation, models fail beyond training context length:
LLaMA 2 (4096 token training context):
Tokens 0-4096: perplexity ≈ 3.2 (expected)
Tokens 4097-8192: perplexity ≈ 7.1 (significant degradation)
Tokens 8193+: perplexity ≈ 15+ (severe degradation)
The model doesn't "know" what to do with position IDs it never
trained on. RoPE rotations at these positions produce unexpected
dot products in attention — the model effectively loses coherence.Long Context Use Cases
Medical records: full patient history in context
A discharge summary can be 5K-20K words
Prior visits, medications, lab results: 50K+ tokens for a complex patient
LLMs with 128K+ context can reason over the full record
Legal documents: contract analysis
A complex contract: 10K-100K tokens
Full document in context → no chunking/retrieval needed
Code understanding: entire repo in context
Small repo: 50K-200K tokens
Future LLMs may process entire codebases in one passPractical Limits Beyond Context Window
Even with 128K context, attention quality degrades for information buried in the middle of the context window — the "lost in the middle" problem (Liu et al., 2023). Information at the very beginning and very end of the context is retrieved more reliably than information in the middle.
Interview Answer
"Production LLMs overwhelmingly use RoPE positional encoding, which rotates Q and K vectors by position-dependent angles so the attention dot product encodes relative distance. LLaMA uses RoPE with base=10000 and max 4096 tokens by default. Context extension is achieved via position interpolation (compress positions to fit training range) or YaRN (frequency-aware interpolation with minimal fine-tuning). Modern frontier models (GPT-4, Claude 3) support 128K-200K context using variants of these techniques. The 'lost in the middle' problem means quality degrades for information in the middle of long contexts even when the model can technically process that length."
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.