Learned Positional Embeddings

Learned vs Fixed Encodings

Instead of computing positional encodings with a formula, learned positional embeddings treat position like a token — each absolute position 0..max_len gets an embedding that is trained jointly with the rest of the model:

Token embedding:    E_token  ∈ ℝ^(vocab_size × d_model)  — one row per token
Position embedding: E_pos    ∈ ℝ^(max_len × d_model)     — one row per position

Input to encoder: E_token[token_id] + E_pos[position]

Both embedding tables are initialised randomly and updated via backpropagation on the pretraining objective.

How BERT Uses Learned Positions

BERT uses three additive embeddings:

Input representation = Token embedding
                     + Segment embedding   (sentence A vs sentence B)
                     + Position embedding  (learned, positions 0..511)

max_len = 512 for BERT-base
Position embedding table: 512 × 768 = 393,216 parameters

During fine-tuning, if the input is shorter than 512, only the first N position embeddings are active. If longer, the model has no representation (can't extrapolate).

How GPT-2 Uses Learned Positions

GPT-2 similarly uses learned absolute position embeddings:

Python

import torch.nn as nn

class GPT2Embedding(nn.Module):
    def __init__(self, vocab_size, max_len, d_model):
        super().__init__()
        self.token_emb = nn.Embedding(vocab_size, d_model)
        self.pos_emb   = nn.Embedding(max_len, d_model)

    def forward(self, token_ids):
        # token_ids: (batch, seq_len)
        positions = torch.arange(token_ids.size(1), device=token_ids.device)
        return self.token_emb(token_ids) + self.pos_emb(positions)

GPT-2 max_len = 1024. GPT-3 max_len = 2048. Positions beyond these are unseen during training and produce garbage output.

Why Learned Embeddings Win In-Distribution

Learned embeddings outperform sinusoidal in practice because:

Task adaptation — the model can learn position representations that are optimal for the pretraining objective on the actual data distribution
Non-uniform importance — positions close to the beginning or end may carry systematically different information; the model can represent this
Interaction with token content — position and token embeddings are trained jointly, so the model can learn co-adapted representations

The gain is modest (a few points on benchmarks) but consistent.

The Extrapolation Problem

Training:    positions 0..511 (BERT) or 0..1023 (GPT-2)
Inference:   position 512+ → unknown embedding → unpredictable output

Approaches to handle it:
  1. Truncate input to max_len (most common, but loses information)
  2. Interpolate existing embeddings (hacky, quality degrades)
  3. Fine-tune on longer sequences (expensive, changes representation)
  4. Switch to RoPE or ALiBi (architectural fix — no extrapolation issue)

This extrapolation failure is a primary motivation for rotary and relative positional encodings.

Comparing Position Encoding Approaches

| Property | Sinusoidal | Learned Absolute | RoPE | ALiBi | |----------|-----------|-----------------|------|-------| | Parameters | 0 | max_len × d_model | 0 | 0 | | Extrapolation | Theory yes, practice limited | No | Better | Yes | | In-distribution perf | Slightly worse | Best among absolute | Best overall | Good | | Complexity | O(1) | O(1) | O(seq) | O(seq) | | Used in | Original Transformer | BERT, GPT-2 | LLaMA, Mistral | MPT, BLOOM |

Fine-Tuning on Longer Contexts

A common technique: pretrain with max_len=2048, then fine-tune on longer sequences up to 8192 or 32768:

Step 1: Pretrain GPT on max_len=2048 (learned pos embeddings 0..2047)
Step 2: Extend position table to 8192: initialise new positions 2048..8191
          by copying/interpolating existing ones
Step 3: Continue training on long-context data
Result: Model gains some long-context ability, though imperfectly

This is why models like GPT-4 have context windows far exceeding their initial pretraining length.

Interview Answer

"Learned positional embeddings treat position like a vocabulary token — each position 0 to max_len gets an embedding in a learned table, jointly trained with the rest of the model. BERT uses max_len=512, GPT-2 uses 1024. They outperform sinusoidal encodings in-distribution because the task can shape the position representations. The key limitation is extrapolation: positions beyond max_len are unseen during training, so the model fails on longer sequences. This drove the move toward relative positional encodings like RoPE, which don't have a fixed maximum length."