Transformer Architecture Q&A · Lesson 15 of 23
ALiBi: Attention with Linear Biases
Core Idea
ALiBi (Press et al., 2021) takes a strikingly simple approach: add a static linear penalty to attention scores based on the distance between the query and key positions. No positional embeddings added to token inputs at all.
Standard attention score:
score(i, j) = q_i · k_j / √dₖ
ALiBi attention score:
score(i, j) = q_i · k_j / √dₖ - m · (i - j)
where:
(i - j) = distance from key position j to query position i (always ≥ 0)
m = head-specific slope (fixed, not learned)Closer tokens get a smaller penalty; distant tokens get a larger penalty. The model is biased toward attending to nearby context.
The Slopes
Each attention head uses a different slope m, creating a geometric sequence:
For h heads, slopes m₁, ..., mₕ:
m_k = 2^(-8k/h) for k = 1, 2, ..., h
Example with 8 heads:
m₁ = 2^(-1) = 0.5
m₂ = 2^(-2) = 0.25
...
m₈ = 2^(-8) = 0.004
Head 1 (steepest slope): strongly penalises distance — attends locally
Head 8 (flattest slope): weakly penalises distance — can attend farDifferent heads specialise in different distance ranges automatically.
Code
import torch
import math
def get_alibi_slopes(num_heads: int) -> torch.Tensor:
def get_slopes_power_of_2(n):
start = 2 ** (-(2 ** -(math.log2(n) - 3)))
ratio = start
return [start * (ratio ** i) for i in range(n)]
if math.log2(num_heads).is_integer():
slopes = get_slopes_power_of_2(num_heads)
else:
# Nearest power of 2 plus interpolation for non-power-of-2 heads
n = 2 ** math.floor(math.log2(num_heads))
slopes = get_slopes_power_of_2(n) + get_slopes_power_of_2(2 * n)[0::2][:num_heads - n]
return torch.tensor(slopes)
def build_alibi_bias(seq_len: int, num_heads: int) -> torch.Tensor:
slopes = get_alibi_slopes(num_heads) # (num_heads,)
positions = torch.arange(seq_len)
# distance matrix: dist[i, j] = i - j (0 for j > i because causal)
dist = positions.unsqueeze(0) - positions.unsqueeze(1) # (seq, seq)
dist = torch.clamp(dist, min=0) # only past distances
# bias: (num_heads, seq, seq) — negative so it penalises distance
bias = -slopes.unsqueeze(-1).unsqueeze(-1) * dist.unsqueeze(0)
return bias
# Usage in attention:
scores = q @ k.transpose(-2, -1) / math.sqrt(d_k)
scores = scores + alibi_bias[:, :seq_len, :seq_len]
attn = torch.softmax(scores, dim=-1)Why ALiBi Extrapolates
The crucial property: ALiBi adds no positional information at the input level. The bias is applied to attention scores at each layer, and the formula works for any distance.
Training on 1024 tokens:
Positions 0..1023 are seen
Bias values are distances 0..1023 multiplied by slopes
Inference on 2048 tokens:
New distances 1024..2047 appear — never seen during training
But the bias is just a linear function of distance
→ Model has already learned "larger penalty = further away"
→ Extrapolation is structurally sound
Compare to learned absolute:
Position 1024 → unseen embedding → garbage representationALiBi vs RoPE
| Property | ALiBi | RoPE | |----------|-------|------| | How applied | Added to attention scores | Applied to Q/K before dot product | | Positional info in embeddings | No (no PE at input) | No (no PE at input) | | Extrapolation | Strong — linear bias scales | Good — frequency-based | | Head specialisation | Built-in (different slopes) | Not built-in | | Implementation complexity | Very simple | Moderate | | Modern usage | MPT, BLOOM, some EleutherAI | LLaMA, Mistral, most 2023+ |
Practical Context Length Extension
ALiBi-based models handle long context at inference without fine-tuning on long sequences:
MPT-7B: trained on 2048 tokens with ALiBi
At inference: tested up to 65K tokens
Performance degrades gracefully rather than catastrophically
RoPE models (without YaRN/LongRoPE fine-tuning):
At inference with 2× training length: significant quality dropInterview Answer
"ALiBi adds a static linear penalty to attention scores: score(i,j) = q·k/√dₖ - m·(i-j), where m is a fixed head-specific slope and (i-j) is the distance from key to query position. Closer tokens are penalised less; distant tokens more. Each head uses a different slope from a geometric sequence, so heads naturally specialise from local to global range. ALiBi requires no positional embeddings in the input and extrapolates gracefully to longer sequences at inference, because the linear bias formula works for any distance — including distances not seen during training."