Learnixo
Back to blog
AI Systemsintermediate

Softmax and Temperature in Attention

How softmax converts attention scores to weights, what temperature does to the distribution, and how sharp vs flat attention affects model behaviour.

Asma Hafeez KhanMay 16, 20263 min read
TransformersSoftmaxTemperatureAttentionInterview
Share:𝕏

Softmax in Attention

Softmax converts raw attention scores (any real numbers) into a probability distribution that sums to 1:

softmax(zᵢ) = exp(zᵢ) / Σⱼ exp(zⱼ)

Input:  scores = [2.0, 1.0, 0.1, -1.0]
Output: weights = [0.65, 0.24, 0.09, 0.02]  (sum = 1.0)

The highest score gets the most weight; negative scores still get non-zero weight (exp makes everything positive). No attention weight is ever exactly 0 — every position contributes something, though very negative scores contribute near nothing.


Temperature Scaling

Temperature T modifies the scores before softmax:

softmax(z / T)

T = 1.0  (default, no scaling)
T < 1.0  → sharper distribution (winner-takes-more)
T > 1.0  → flatter distribution (more uniform)
Python
import torch
import torch.nn.functional as F

scores = torch.tensor([2.0, 1.0, 0.1, -1.0])

for T in [0.1, 0.5, 1.0, 2.0, 10.0]:
    weights = F.softmax(scores / T, dim=0)
    print(f"T={T}: {weights.numpy().round(3)}")

# T=0.1:  [0.999, 0.001, 0.000, 0.000]   near one-hot
# T=0.5:  [0.924, 0.071, 0.005, 0.000]
# T=1.0:  [0.651, 0.239, 0.088, 0.022]   standard
# T=2.0:  [0.469, 0.285, 0.177, 0.069]
# T=10.0: [0.279, 0.263, 0.244, 0.214]   near uniform

Sharp vs Flat Attention

Sharp attention (low T / high scores):

  • The model focuses intensely on one or few positions
  • Useful for tasks requiring precise token lookup
  • Risk: ignores potentially relevant context

Flat attention (high T / low scores):

  • The model blends information from many positions roughly equally
  • More robust to small score differences
  • Risk: dilutes the important signal

In transformers trained with gradient descent, the model learns when to be sharp and when to be flat through the Q, K weight matrices. Temperature is usually fixed at T=1 during training (equivalent to the √dₖ scaling).


The Scaling Factor √dₖ Is Temperature

The division by √dₖ in attention is exactly temperature scaling with T=√dₖ:

Attention(Q, K, V) = softmax(Q·Kᵀ / √dₖ) · V
                              ↑
                    T = √dₖ (temperature)

Without this, larger dₖ causes larger dot products → sharper attention → gradients vanish. Setting T=√dₖ normalises the variance of the dot products, keeping the softmax in a gradient-friendly regime.


Softmax in Generation vs Attention

Note: temperature in generation (sampling next token) is a different use of the same concept:

Generation:   softmax(logits / T)  — controls diversity of sampled tokens
Attention:    softmax(Q·Kᵀ / √dₖ) — scaled by a fixed √dₖ, not a tunable T

In generation, T is a hyperparameter you tune. In attention, √dₖ is a fixed architectural choice. Both exploit the same property of softmax.


Attention Entropy

The entropy of the attention distribution measures how spread out it is:

H(A) = -Σⱼ Aᵢⱼ · log(Aᵢⱼ)

Low entropy (near 0):   sharp — model attends to one position
High entropy (near log n): flat — model attends uniformly to n positions

Researchers have observed that attention entropy increases during training as models learn to focus. Different layers develop different entropy profiles — early layers often have higher entropy, later layers lower.


Interview Answer

"Softmax converts attention scores to a probability distribution summing to 1, allowing us to compute a weighted average of value vectors. Temperature T scales scores before softmax: lower T sharpens the distribution (near one-hot), higher T flattens it (near uniform). In standard attention, the √dₖ factor plays the role of temperature — it prevents the dot products from growing too large, which would saturate the softmax and kill gradients. The model learns appropriate attention sharpness for each head through training."

Enjoyed this article?

Explore the AI Systems learning path for more.

Found this helpful?

Share:𝕏

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.