Learned Positional Embeddings
How BERT and GPT-2 learn position embeddings from data, the trade-offs vs sinusoidal encodings, and why learned embeddings dominate in practice despite their length limitation.
Learned vs Fixed Encodings
Instead of computing positional encodings with a formula, learned positional embeddings treat position like a token — each absolute position 0..max_len gets an embedding that is trained jointly with the rest of the model:
Token embedding: E_token ∈ ℝ^(vocab_size × d_model) — one row per token
Position embedding: E_pos ∈ ℝ^(max_len × d_model) — one row per position
Input to encoder: E_token[token_id] + E_pos[position]Both embedding tables are initialised randomly and updated via backpropagation on the pretraining objective.
How BERT Uses Learned Positions
BERT uses three additive embeddings:
Input representation = Token embedding
+ Segment embedding (sentence A vs sentence B)
+ Position embedding (learned, positions 0..511)
max_len = 512 for BERT-base
Position embedding table: 512 × 768 = 393,216 parametersDuring fine-tuning, if the input is shorter than 512, only the first N position embeddings are active. If longer, the model has no representation (can't extrapolate).
How GPT-2 Uses Learned Positions
GPT-2 similarly uses learned absolute position embeddings:
import torch.nn as nn
class GPT2Embedding(nn.Module):
def __init__(self, vocab_size, max_len, d_model):
super().__init__()
self.token_emb = nn.Embedding(vocab_size, d_model)
self.pos_emb = nn.Embedding(max_len, d_model)
def forward(self, token_ids):
# token_ids: (batch, seq_len)
positions = torch.arange(token_ids.size(1), device=token_ids.device)
return self.token_emb(token_ids) + self.pos_emb(positions)GPT-2 max_len = 1024. GPT-3 max_len = 2048. Positions beyond these are unseen during training and produce garbage output.
Why Learned Embeddings Win In-Distribution
Learned embeddings outperform sinusoidal in practice because:
- Task adaptation — the model can learn position representations that are optimal for the pretraining objective on the actual data distribution
- Non-uniform importance — positions close to the beginning or end may carry systematically different information; the model can represent this
- Interaction with token content — position and token embeddings are trained jointly, so the model can learn co-adapted representations
The gain is modest (a few points on benchmarks) but consistent.
The Extrapolation Problem
Training: positions 0..511 (BERT) or 0..1023 (GPT-2)
Inference: position 512+ → unknown embedding → unpredictable output
Approaches to handle it:
1. Truncate input to max_len (most common, but loses information)
2. Interpolate existing embeddings (hacky, quality degrades)
3. Fine-tune on longer sequences (expensive, changes representation)
4. Switch to RoPE or ALiBi (architectural fix — no extrapolation issue)This extrapolation failure is a primary motivation for rotary and relative positional encodings.
Comparing Position Encoding Approaches
| Property | Sinusoidal | Learned Absolute | RoPE | ALiBi | |----------|-----------|-----------------|------|-------| | Parameters | 0 | max_len × d_model | 0 | 0 | | Extrapolation | Theory yes, practice limited | No | Better | Yes | | In-distribution perf | Slightly worse | Best among absolute | Best overall | Good | | Complexity | O(1) | O(1) | O(seq) | O(seq) | | Used in | Original Transformer | BERT, GPT-2 | LLaMA, Mistral | MPT, BLOOM |
Fine-Tuning on Longer Contexts
A common technique: pretrain with max_len=2048, then fine-tune on longer sequences up to 8192 or 32768:
Step 1: Pretrain GPT on max_len=2048 (learned pos embeddings 0..2047)
Step 2: Extend position table to 8192: initialise new positions 2048..8191
by copying/interpolating existing ones
Step 3: Continue training on long-context data
Result: Model gains some long-context ability, though imperfectlyThis is why models like GPT-4 have context windows far exceeding their initial pretraining length.
Interview Answer
"Learned positional embeddings treat position like a vocabulary token — each position 0 to max_len gets an embedding in a learned table, jointly trained with the rest of the model. BERT uses max_len=512, GPT-2 uses 1024. They outperform sinusoidal encodings in-distribution because the task can shape the position representations. The key limitation is extrapolation: positions beyond max_len are unseen during training, so the model fails on longer sequences. This drove the move toward relative positional encodings like RoPE, which don't have a fixed maximum length."
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.