ALiBi: Attention with Linear Biases

Core Idea

ALiBi (Press et al., 2021) takes a strikingly simple approach: add a static linear penalty to attention scores based on the distance between the query and key positions. No positional embeddings added to token inputs at all.

Standard attention score:
  score(i, j) = q_i · k_j / √dₖ

ALiBi attention score:
  score(i, j) = q_i · k_j / √dₖ  -  m · (i - j)

where:
  (i - j) = distance from key position j to query position i (always ≥ 0)
  m       = head-specific slope (fixed, not learned)

Closer tokens get a smaller penalty; distant tokens get a larger penalty. The model is biased toward attending to nearby context.

The Slopes

Each attention head uses a different slope m, creating a geometric sequence:

For h heads, slopes m₁, ..., mₕ:

m_k = 2^(-8k/h)  for k = 1, 2, ..., h

Example with 8 heads:
  m₁ = 2^(-1) = 0.5
  m₂ = 2^(-2) = 0.25
  ...
  m₈ = 2^(-8) = 0.004

Head 1 (steepest slope): strongly penalises distance — attends locally
Head 8 (flattest slope):  weakly penalises distance — can attend far

Different heads specialise in different distance ranges automatically.

Code

Python

import torch
import math

def get_alibi_slopes(num_heads: int) -> torch.Tensor:
    def get_slopes_power_of_2(n):
        start = 2 ** (-(2 ** -(math.log2(n) - 3)))
        ratio = start
        return [start * (ratio ** i) for i in range(n)]

    if math.log2(num_heads).is_integer():
        slopes = get_slopes_power_of_2(num_heads)
    else:
        # Nearest power of 2 plus interpolation for non-power-of-2 heads
        n = 2 ** math.floor(math.log2(num_heads))
        slopes = get_slopes_power_of_2(n) + get_slopes_power_of_2(2 * n)[0::2][:num_heads - n]

    return torch.tensor(slopes)

def build_alibi_bias(seq_len: int, num_heads: int) -> torch.Tensor:
    slopes = get_alibi_slopes(num_heads)            # (num_heads,)
    positions = torch.arange(seq_len)
    # distance matrix: dist[i, j] = i - j (0 for j > i because causal)
    dist = positions.unsqueeze(0) - positions.unsqueeze(1)  # (seq, seq)
    dist = torch.clamp(dist, min=0)                          # only past distances

    # bias: (num_heads, seq, seq) — negative so it penalises distance
    bias = -slopes.unsqueeze(-1).unsqueeze(-1) * dist.unsqueeze(0)
    return bias

# Usage in attention:
scores = q @ k.transpose(-2, -1) / math.sqrt(d_k)
scores = scores + alibi_bias[:, :seq_len, :seq_len]
attn = torch.softmax(scores, dim=-1)

Why ALiBi Extrapolates

The crucial property: ALiBi adds no positional information at the input level. The bias is applied to attention scores at each layer, and the formula works for any distance.

Training on 1024 tokens:
  Positions 0..1023 are seen
  Bias values are distances 0..1023 multiplied by slopes

Inference on 2048 tokens:
  New distances 1024..2047 appear — never seen during training
  But the bias is just a linear function of distance
  → Model has already learned "larger penalty = further away"
  → Extrapolation is structurally sound

Compare to learned absolute:
  Position 1024 → unseen embedding → garbage representation

ALiBi vs RoPE

| Property | ALiBi | RoPE | |----------|-------|------| | How applied | Added to attention scores | Applied to Q/K before dot product | | Positional info in embeddings | No (no PE at input) | No (no PE at input) | | Extrapolation | Strong — linear bias scales | Good — frequency-based | | Head specialisation | Built-in (different slopes) | Not built-in | | Implementation complexity | Very simple | Moderate | | Modern usage | MPT, BLOOM, some EleutherAI | LLaMA, Mistral, most 2023+ |

Practical Context Length Extension

ALiBi-based models handle long context at inference without fine-tuning on long sequences:

MPT-7B: trained on 2048 tokens with ALiBi
At inference: tested up to 65K tokens
Performance degrades gracefully rather than catastrophically

RoPE models (without YaRN/LongRoPE fine-tuning):
At inference with 2× training length: significant quality drop

Interview Answer

"ALiBi adds a static linear penalty to attention scores: score(i,j) = q·k/√dₖ - m·(i-j), where m is a fixed head-specific slope and (i-j) is the distance from key to query position. Closer tokens are penalised less; distant tokens more. Each head uses a different slope from a geometric sequence, so heads naturally specialise from local to global range. ALiBi requires no positional embeddings in the input and extrapolates gracefully to longer sequences at inference, because the linear bias formula works for any distance — including distances not seen during training."