Learnixo

Transformer Architecture Q&A · Lesson 4 of 23

Softmax and Temperature in Attention

Softmax in Attention

Softmax converts raw attention scores (any real numbers) into a probability distribution that sums to 1:

softmax(zᵢ) = exp(zᵢ) / Σⱼ exp(zⱼ)

Input:  scores = [2.0, 1.0, 0.1, -1.0]
Output: weights = [0.65, 0.24, 0.09, 0.02]  (sum = 1.0)

The highest score gets the most weight; negative scores still get non-zero weight (exp makes everything positive). No attention weight is ever exactly 0 — every position contributes something, though very negative scores contribute near nothing.


Temperature Scaling

Temperature T modifies the scores before softmax:

softmax(z / T)

T = 1.0  (default, no scaling)
T < 1.0  → sharper distribution (winner-takes-more)
T > 1.0  → flatter distribution (more uniform)
Python
import torch
import torch.nn.functional as F

scores = torch.tensor([2.0, 1.0, 0.1, -1.0])

for T in [0.1, 0.5, 1.0, 2.0, 10.0]:
    weights = F.softmax(scores / T, dim=0)
    print(f"T={T}: {weights.numpy().round(3)}")

# T=0.1:  [0.999, 0.001, 0.000, 0.000]   near one-hot
# T=0.5:  [0.924, 0.071, 0.005, 0.000]
# T=1.0:  [0.651, 0.239, 0.088, 0.022]   standard
# T=2.0:  [0.469, 0.285, 0.177, 0.069]
# T=10.0: [0.279, 0.263, 0.244, 0.214]   near uniform

Sharp vs Flat Attention

Sharp attention (low T / high scores):

  • The model focuses intensely on one or few positions
  • Useful for tasks requiring precise token lookup
  • Risk: ignores potentially relevant context

Flat attention (high T / low scores):

  • The model blends information from many positions roughly equally
  • More robust to small score differences
  • Risk: dilutes the important signal

In transformers trained with gradient descent, the model learns when to be sharp and when to be flat through the Q, K weight matrices. Temperature is usually fixed at T=1 during training (equivalent to the √dₖ scaling).


The Scaling Factor √dₖ Is Temperature

The division by √dₖ in attention is exactly temperature scaling with T=√dₖ:

Attention(Q, K, V) = softmax(Q·Kᵀ / √dₖ) · V
                              ↑
                    T = √dₖ (temperature)

Without this, larger dₖ causes larger dot products → sharper attention → gradients vanish. Setting T=√dₖ normalises the variance of the dot products, keeping the softmax in a gradient-friendly regime.


Softmax in Generation vs Attention

Note: temperature in generation (sampling next token) is a different use of the same concept:

Generation:   softmax(logits / T)  — controls diversity of sampled tokens
Attention:    softmax(Q·Kᵀ / √dₖ) — scaled by a fixed √dₖ, not a tunable T

In generation, T is a hyperparameter you tune. In attention, √dₖ is a fixed architectural choice. Both exploit the same property of softmax.


Attention Entropy

The entropy of the attention distribution measures how spread out it is:

H(A) = -Σⱼ Aᵢⱼ · log(Aᵢⱼ)

Low entropy (near 0):   sharp — model attends to one position
High entropy (near log n): flat — model attends uniformly to n positions

Researchers have observed that attention entropy increases during training as models learn to focus. Different layers develop different entropy profiles — early layers often have higher entropy, later layers lower.


Interview Answer

"Softmax converts attention scores to a probability distribution summing to 1, allowing us to compute a weighted average of value vectors. Temperature T scales scores before softmax: lower T sharpens the distribution (near one-hot), higher T flattens it (near uniform). In standard attention, the √dₖ factor plays the role of temperature — it prevents the dot products from growing too large, which would saturate the softmax and kill gradients. The model learns appropriate attention sharpness for each head through training."