Learnixo

LLMs Deep Dive · Lesson 8 of 24

Positional Encodings: RoPE, ALiBi, Sinusoidal

Evolution in Production LLMs

GPT-2 (2019):     Learned absolute, max 1024 tokens
GPT-3 (2020):     Learned absolute, max 2048 tokens
BERT (2018):      Learned absolute, max 512 tokens
T5 (2019):        Relative bias (T5 relative attention), max 512 default
LLaMA 1 (2023):   RoPE, max 2048 tokens
LLaMA 2 (2023):   RoPE, max 4096 tokens
LLaMA 3 (2024):   RoPE, max 8192 (base), 128K (extended)
Mistral 7B:       RoPE + sliding window, max 32K effective
GPT-4:            Unknown (claimed 128K context)
Claude 3:         Unknown (200K context)

Context length has grown ~250× from GPT-2 to frontier models in 5 years, driven by architectural improvements + long-context fine-tuning.


RoPE in LLMs: Practical Details

RoPE applies rotation to Q and K at each attention layer, using precomputed frequency tables:

Python
import torch
import math

def precompute_freqs_cis(head_dim: int, max_seq_len: int, base: float = 10000.0):
    theta = 1.0 / (base ** (torch.arange(0, head_dim, 2).float() / head_dim))
    t = torch.arange(max_seq_len)
    freqs = torch.outer(t, theta)
    return torch.polar(torch.ones_like(freqs), freqs)  # complex numbers

def apply_rotary_emb(xq, xk, freqs_cis):
    # xq, xk: (batch, seq_len, heads, head_dim)
    xq_ = torch.view_as_complex(xq.float().reshape(*xq.shape[:-1], -1, 2))
    xk_ = torch.view_as_complex(xk.float().reshape(*xk.shape[:-1], -1, 2))
    freqs_cis = freqs_cis[:xq_.shape[1]]  # trim to actual seq len
    xq_out = torch.view_as_real(xq_ * freqs_cis).flatten(3)
    xk_out = torch.view_as_real(xk_ * freqs_cis).flatten(3)
    return xq_out.type_as(xq), xk_out.type_as(xk)

The LLaMA implementation uses complex number multiplication for efficiency — equivalent to the 2D rotation matrix formulation.


Context Length Extension

When a model trained on 4096 tokens is asked to process 16K tokens, RoPE frequencies enter unseen territory. Several techniques extend context:

1. Position interpolation (Chen et al., 2023):
   Scale positions by factor s = training_len / inference_len
   Positions 0..16383 → 0..4095 (compressed)
   Fine-tune on long documents for ~1000 steps
   Works: LLaMA 2 extended to 32K this way

2. YaRN (Peng et al., 2023):
   Scale different frequency dimensions differently
   High-frequency dims: no scaling (good short-range)
   Low-frequency dims: interpolate (good long-range)
   Better perplexity than uniform interpolation
   Used: Mistral 7B v0.2 (32K), Yarn-LLaMA-2 (128K)

3. LongRoPE (Microsoft, 2024):
   Evolutionary search for optimal frequency scaling
   LLaMA up to 2M tokens with minimal quality loss

The Perplexity Cliff

Without position interpolation, models fail beyond training context length:

LLaMA 2 (4096 token training context):
  Tokens 0-4096:   perplexity ≈ 3.2 (expected)
  Tokens 4097-8192: perplexity ≈ 7.1 (significant degradation)
  Tokens 8193+:    perplexity ≈ 15+ (severe degradation)

The model doesn't "know" what to do with position IDs it never
trained on. RoPE rotations at these positions produce unexpected
dot products in attention — the model effectively loses coherence.

Long Context Use Cases

Medical records: full patient history in context
  A discharge summary can be 5K-20K words
  Prior visits, medications, lab results: 50K+ tokens for a complex patient
  LLMs with 128K+ context can reason over the full record

Legal documents: contract analysis
  A complex contract: 10K-100K tokens
  Full document in context → no chunking/retrieval needed

Code understanding: entire repo in context
  Small repo: 50K-200K tokens
  Future LLMs may process entire codebases in one pass

Practical Limits Beyond Context Window

Even with 128K context, attention quality degrades for information buried in the middle of the context window — the "lost in the middle" problem (Liu et al., 2023). Information at the very beginning and very end of the context is retrieved more reliably than information in the middle.


Interview Answer

"Production LLMs overwhelmingly use RoPE positional encoding, which rotates Q and K vectors by position-dependent angles so the attention dot product encodes relative distance. LLaMA uses RoPE with base=10000 and max 4096 tokens by default. Context extension is achieved via position interpolation (compress positions to fit training range) or YaRN (frequency-aware interpolation with minimal fine-tuning). Modern frontier models (GPT-4, Claude 3) support 128K-200K context using variants of these techniques. The 'lost in the middle' problem means quality degrades for information in the middle of long contexts even when the model can technically process that length."