Transformer Architecture Q&A · Lesson 4 of 23
Softmax and Temperature in Attention
Softmax in Attention
Softmax converts raw attention scores (any real numbers) into a probability distribution that sums to 1:
softmax(zᵢ) = exp(zᵢ) / Σⱼ exp(zⱼ)
Input: scores = [2.0, 1.0, 0.1, -1.0]
Output: weights = [0.65, 0.24, 0.09, 0.02] (sum = 1.0)The highest score gets the most weight; negative scores still get non-zero weight (exp makes everything positive). No attention weight is ever exactly 0 — every position contributes something, though very negative scores contribute near nothing.
Temperature Scaling
Temperature T modifies the scores before softmax:
softmax(z / T)
T = 1.0 (default, no scaling)
T < 1.0 → sharper distribution (winner-takes-more)
T > 1.0 → flatter distribution (more uniform)import torch
import torch.nn.functional as F
scores = torch.tensor([2.0, 1.0, 0.1, -1.0])
for T in [0.1, 0.5, 1.0, 2.0, 10.0]:
weights = F.softmax(scores / T, dim=0)
print(f"T={T}: {weights.numpy().round(3)}")
# T=0.1: [0.999, 0.001, 0.000, 0.000] ← near one-hot
# T=0.5: [0.924, 0.071, 0.005, 0.000]
# T=1.0: [0.651, 0.239, 0.088, 0.022] ← standard
# T=2.0: [0.469, 0.285, 0.177, 0.069]
# T=10.0: [0.279, 0.263, 0.244, 0.214] ← near uniformSharp vs Flat Attention
Sharp attention (low T / high scores):
- The model focuses intensely on one or few positions
- Useful for tasks requiring precise token lookup
- Risk: ignores potentially relevant context
Flat attention (high T / low scores):
- The model blends information from many positions roughly equally
- More robust to small score differences
- Risk: dilutes the important signal
In transformers trained with gradient descent, the model learns when to be sharp and when to be flat through the Q, K weight matrices. Temperature is usually fixed at T=1 during training (equivalent to the √dₖ scaling).
The Scaling Factor √dₖ Is Temperature
The division by √dₖ in attention is exactly temperature scaling with T=√dₖ:
Attention(Q, K, V) = softmax(Q·Kᵀ / √dₖ) · V
↑
T = √dₖ (temperature)Without this, larger dₖ causes larger dot products → sharper attention → gradients vanish. Setting T=√dₖ normalises the variance of the dot products, keeping the softmax in a gradient-friendly regime.
Softmax in Generation vs Attention
Note: temperature in generation (sampling next token) is a different use of the same concept:
Generation: softmax(logits / T) — controls diversity of sampled tokens
Attention: softmax(Q·Kᵀ / √dₖ) — scaled by a fixed √dₖ, not a tunable TIn generation, T is a hyperparameter you tune. In attention, √dₖ is a fixed architectural choice. Both exploit the same property of softmax.
Attention Entropy
The entropy of the attention distribution measures how spread out it is:
H(A) = -Σⱼ Aᵢⱼ · log(Aᵢⱼ)
Low entropy (near 0): sharp — model attends to one position
High entropy (near log n): flat — model attends uniformly to n positionsResearchers have observed that attention entropy increases during training as models learn to focus. Different layers develop different entropy profiles — early layers often have higher entropy, later layers lower.
Interview Answer
"Softmax converts attention scores to a probability distribution summing to 1, allowing us to compute a weighted average of value vectors. Temperature T scales scores before softmax: lower T sharpens the distribution (near one-hot), higher T flattens it (near uniform). In standard attention, the √dₖ factor plays the role of temperature — it prevents the dot products from growing too large, which would saturate the softmax and kill gradients. The model learns appropriate attention sharpness for each head through training."