Softmax and Temperature in Attention
How softmax converts attention scores to weights, what temperature does to the distribution, and how sharp vs flat attention affects model behaviour.
Softmax in Attention
Softmax converts raw attention scores (any real numbers) into a probability distribution that sums to 1:
softmax(zᵢ) = exp(zᵢ) / Σⱼ exp(zⱼ)
Input: scores = [2.0, 1.0, 0.1, -1.0]
Output: weights = [0.65, 0.24, 0.09, 0.02] (sum = 1.0)The highest score gets the most weight; negative scores still get non-zero weight (exp makes everything positive). No attention weight is ever exactly 0 — every position contributes something, though very negative scores contribute near nothing.
Temperature Scaling
Temperature T modifies the scores before softmax:
softmax(z / T)
T = 1.0 (default, no scaling)
T < 1.0 → sharper distribution (winner-takes-more)
T > 1.0 → flatter distribution (more uniform)import torch
import torch.nn.functional as F
scores = torch.tensor([2.0, 1.0, 0.1, -1.0])
for T in [0.1, 0.5, 1.0, 2.0, 10.0]:
weights = F.softmax(scores / T, dim=0)
print(f"T={T}: {weights.numpy().round(3)}")
# T=0.1: [0.999, 0.001, 0.000, 0.000] ← near one-hot
# T=0.5: [0.924, 0.071, 0.005, 0.000]
# T=1.0: [0.651, 0.239, 0.088, 0.022] ← standard
# T=2.0: [0.469, 0.285, 0.177, 0.069]
# T=10.0: [0.279, 0.263, 0.244, 0.214] ← near uniformSharp vs Flat Attention
Sharp attention (low T / high scores):
- The model focuses intensely on one or few positions
- Useful for tasks requiring precise token lookup
- Risk: ignores potentially relevant context
Flat attention (high T / low scores):
- The model blends information from many positions roughly equally
- More robust to small score differences
- Risk: dilutes the important signal
In transformers trained with gradient descent, the model learns when to be sharp and when to be flat through the Q, K weight matrices. Temperature is usually fixed at T=1 during training (equivalent to the √dₖ scaling).
The Scaling Factor √dₖ Is Temperature
The division by √dₖ in attention is exactly temperature scaling with T=√dₖ:
Attention(Q, K, V) = softmax(Q·Kᵀ / √dₖ) · V
↑
T = √dₖ (temperature)Without this, larger dₖ causes larger dot products → sharper attention → gradients vanish. Setting T=√dₖ normalises the variance of the dot products, keeping the softmax in a gradient-friendly regime.
Softmax in Generation vs Attention
Note: temperature in generation (sampling next token) is a different use of the same concept:
Generation: softmax(logits / T) — controls diversity of sampled tokens
Attention: softmax(Q·Kᵀ / √dₖ) — scaled by a fixed √dₖ, not a tunable TIn generation, T is a hyperparameter you tune. In attention, √dₖ is a fixed architectural choice. Both exploit the same property of softmax.
Attention Entropy
The entropy of the attention distribution measures how spread out it is:
H(A) = -Σⱼ Aᵢⱼ · log(Aᵢⱼ)
Low entropy (near 0): sharp — model attends to one position
High entropy (near log n): flat — model attends uniformly to n positionsResearchers have observed that attention entropy increases during training as models learn to focus. Different layers develop different entropy profiles — early layers often have higher entropy, later layers lower.
Interview Answer
"Softmax converts attention scores to a probability distribution summing to 1, allowing us to compute a weighted average of value vectors. Temperature T scales scores before softmax: lower T sharpens the distribution (near one-hot), higher T flattens it (near uniform). In standard attention, the √dₖ factor plays the role of temperature — it prevents the dot products from growing too large, which would saturate the softmax and kill gradients. The model learns appropriate attention sharpness for each head through training."
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.