Query, Key, and Value Matrices

Where Q, K, V Come From

Every attention head has three learned weight matrices: Wᴬ, Wᴷ, Wᵛ. Given an input sequence X (shape: seq_len × d_model), Q, K, V are computed as linear projections:

Q = X · Wᴬ    (shape: seq_len × dₖ)
K = X · Wᴷ    (shape: seq_len × dₖ)
V = X · Wᵛ    (shape: seq_len × dᵥ)

where:
  dₖ = dᵥ = d_model / num_heads  (in standard transformers)
  Wᴬ, Wᴷ ∈ ℝ^(d_model × dₖ)
  Wᵛ ∈ ℝ^(d_model × dᵥ)

In self-attention, Q, K, V are all derived from the same input X. In cross-attention, Q comes from the decoder's input and K, V come from the encoder's output.

Conceptual Roles

Query (Q): What this position is looking for. Each query vector represents the "search question" for one output position.

Key (K): What each position offers or represents. Each key vector is a "label" that the query compares against.

Value (V): The content to be retrieved. Once the attention weights are computed from Q·Kᵀ, the values are blended according to those weights.

Database analogy (approximate):
  Query → search query ("find documents about rivers")
  Key   → document index / tags ("this document is about rivers and floods")
  Value → document content (the actual text to retrieve)

Unlike a database, everything is soft/continuous — no exact match required.
A query vector that's similar (in dot-product sense) to a key vector
gets a high weight for that key's corresponding value.

Why Three Separate Matrices?

Using three separate projections (not one) gives the model flexibility:

Q and K can encode different aspects of the same token. A word can "ask about" different properties than what it "offers" to others.
V is independent of the matching. The value representation can be richer or different from the key used for matching.
More learnable parameters mean the model can learn different roles for the same token embedding.

A simpler design (Q=K=V=X) would force the model to use the same representation for searching, indexing, and content retrieval — too constrained.

Dimensions: Why Scale by √dₖ?

Without scaling, when dₖ is large, the dot products Q·Kᵀ grow large in magnitude. This pushes the softmax into regions of very small gradients (the softmax "saturates"), making training slow.

Variance argument:
  If Q and K are independent random vectors with unit variance components,
  dot product Q·K has variance dₖ (sum of dₖ independent unit-variance products).
  Dividing by √dₖ restores variance to 1, keeping softmax inputs in a well-behaved range.

Code: Manual Q, K, V Projection

Python

import torch
import torch.nn as nn

class QKVProjection(nn.Module):
    def __init__(self, d_model: int, d_k: int):
        super().__init__()
        self.W_q = nn.Linear(d_model, d_k, bias=False)
        self.W_k = nn.Linear(d_model, d_k, bias=False)
        self.W_v = nn.Linear(d_model, d_k, bias=False)

    def forward(self, x):
        # x: (batch, seq_len, d_model)
        Q = self.W_q(x)   # (batch, seq_len, d_k)
        K = self.W_k(x)   # (batch, seq_len, d_k)
        V = self.W_v(x)   # (batch, seq_len, d_k)
        return Q, K, V

Interview Answer

"Q, K, V are linear projections of the input. Query represents what each position is looking for; Key represents what each position can offer for comparison; Value contains the actual content to aggregate. The attention score is computed by comparing Q against K (dot product, scaled by √dₖ), then softmaxed to get weights, then multiplied by V to produce the output. The three separate matrices give the model flexibility — the representation used for searching (Q/K) can differ from the representation used for aggregation (V)."