Temperature and Sampling Parameters
Control LLM output diversity with temperature, top-k, top-p, and repetition penalties. Learn when to use deterministic vs stochastic sampling for different task types.
The Sampling Decision
After the model computes a probability distribution over the vocabulary, it must choose the next token. Sampling parameters control how this choice is made:
- Deterministic (greedy): Always pick the highest-probability token
- Stochastic: Sample randomly from the distribution, shaped by parameters
The distribution before sampling is a vector of logits ā one per vocabulary token. The final token is selected from this shaped distribution.
Temperature
Temperature divides logits before softmax, controlling distribution sharpness:
import numpy as np
def apply_temperature(logits: np.ndarray, temperature: float) -> np.ndarray:
"""Scale logits by temperature before softmax."""
if temperature == 0:
# Greedy: one-hot at argmax
result = np.zeros_like(logits)
result[np.argmax(logits)] = 1.0
return result
scaled = logits / temperature
# Numerically stable softmax
scaled -= scaled.max()
exp_logits = np.exp(scaled)
return exp_logits / exp_logits.sum()
# Example: 5-token vocabulary
logits = np.array([3.0, 2.0, 1.5, 0.5, 0.1])
for temp in [0.1, 0.5, 1.0, 1.5, 2.0]:
probs = apply_temperature(logits, temp)
print(f"T={temp:.1f}: {probs.round(3)}")
# T=0.1: [0.999, 0.001, 0.000, 0.000, 0.000] ā very peaked
# T=0.5: [0.887, 0.095, 0.016, 0.001, 0.000] ā peaked
# T=1.0: [0.658, 0.242, 0.066, 0.027, 0.007] ā model's true distribution
# T=1.5: [0.476, 0.274, 0.143, 0.072, 0.035] ā more spread
# T=2.0: [0.373, 0.264, 0.195, 0.113, 0.055] ā nearly uniformTemperature guidelines:
T = 0: Fully deterministic ā same input always produces same outputT ā 0.1ā0.3: Near-deterministic with occasional variationT ā 0.7ā1.0: Balanced ā good for general chat and creative tasksT > 1.0: Increases randomness, may introduce incoherence
Top-K Sampling
Truncate the distribution to the K most probable tokens before sampling:
def top_k_sample(logits: np.ndarray, k: int, temperature: float = 1.0) -> int:
"""Sample from top-k tokens only."""
# Get top-k indices and values
top_k_indices = np.argsort(logits)[-k:]
top_k_logits = logits[top_k_indices]
# Apply temperature
probs = apply_temperature(top_k_logits, temperature)
# Sample
chosen_idx = np.random.choice(len(top_k_logits), p=probs)
return top_k_indices[chosen_idx]
# k=1 is equivalent to greedy
# k=50 is OpenAI's default ā allows meaningful diversity while preventing garbage
# k=vocabulary_size means no truncation (pure temperature sampling)Problem with top-k: A fixed K can be too few (when the distribution is flat) or too many (when one token dominates). Top-k=50 means 50 tokens in all contexts ā but sometimes 2 tokens cover 99% of probability; sometimes 1000 tokens each have meaningful probability.
Top-P (Nucleus Sampling)
Select the minimum set of tokens covering probability mass P, then sample from them:
def top_p_sample(logits: np.ndarray, p: float = 0.9, temperature: float = 1.0) -> int:
"""Nucleus sampling: sample from tokens that collectively cover probability p."""
# Apply temperature first
probs = apply_temperature(logits, temperature)
# Sort tokens by probability (descending)
sorted_indices = np.argsort(-probs)
sorted_probs = probs[sorted_indices]
# Find the smallest set that covers cumulative probability >= p
cumulative_probs = np.cumsum(sorted_probs)
cutoff_idx = np.searchsorted(cumulative_probs, p) + 1 # +1 to include the cutoff token
# Truncate and renormalize
nucleus_indices = sorted_indices[:cutoff_idx]
nucleus_probs = probs[nucleus_indices]
nucleus_probs = nucleus_probs / nucleus_probs.sum()
# Sample
chosen_idx = np.random.choice(len(nucleus_probs), p=nucleus_probs)
return nucleus_indices[chosen_idx]
# Practical examples:
# When p=0.9:
# - If top token has 95% probability ā nucleus = 1 token (near-greedy)
# - If top 100 tokens each have 0.9% ā nucleus = 100 tokens (more creative)
# This adaptive behavior is why top-p often outperforms top-kCombining Temperature, Top-K, and Top-P
In production APIs, all three are applied together:
from openai import OpenAI
client = OpenAI()
def generate_with_params(
prompt: str,
temperature: float = 1.0,
top_p: float = 1.0,
frequency_penalty: float = 0.0,
presence_penalty: float = 0.0,
max_tokens: int = 500,
) -> str:
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
temperature=temperature,
top_p=top_p,
frequency_penalty=frequency_penalty,
presence_penalty=presence_penalty,
max_tokens=max_tokens,
)
return response.choices[0].message.content
# Task-appropriate configurations
CONFIGS = {
"factual_qa": {
"temperature": 0.0,
"top_p": 1.0,
"frequency_penalty": 0.0,
"presence_penalty": 0.0,
},
"code_generation": {
"temperature": 0.2,
"top_p": 0.95,
"frequency_penalty": 0.0,
"presence_penalty": 0.0,
},
"general_chat": {
"temperature": 0.7,
"top_p": 0.9,
"frequency_penalty": 0.3,
"presence_penalty": 0.0,
},
"creative_writing": {
"temperature": 1.0,
"top_p": 0.95,
"frequency_penalty": 0.5,
"presence_penalty": 0.5,
},
"brainstorming": {
"temperature": 1.2,
"top_p": 1.0,
"frequency_penalty": 0.8,
"presence_penalty": 0.6,
},
}
# Use appropriate config for each task
def generate_for_task(prompt: str, task: str) -> str:
config = CONFIGS.get(task, CONFIGS["general_chat"])
return generate_with_params(prompt, **config)Repetition Penalties
Frequency penalty: Reduces probability of tokens proportional to how often they've appeared. Prevents verbatim repetition.
Presence penalty: Reduces probability of any token that has appeared at all (binary ā present or not). Encourages topic diversity.
# How frequency penalty works (simplified):
# logit_adjusted[token] = logit[token] - frequency_penalty Ć count(token)
# Frequency penalty = 0.0: no adjustment (default)
# Frequency penalty = 0.5: moderate repetition reduction
# Frequency penalty = 2.0: strong repetition avoidance (may cause incoherence)
# Example: prevents "the drug" "the drug" "the drug" repeating
response = generate_with_params(
"Tell me about warfarin's properties.",
frequency_penalty=0.5, # Moderately penalize repeated words
)Temperature for Reproducibility
When reproducibility matters (testing, debugging), use temperature=0 and a fixed seed:
# OpenAI seed parameter for reproducibility
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "What is the warfarin-clarithromycin interaction?"}],
temperature=0,
seed=42, # Ensures identical output for identical input
)
# system_fingerprint indicates model version ā same fingerprint = same model weights
print(f"System fingerprint: {response.system_fingerprint}")
# If fingerprint changes, model was updated and may produce different output even with seed=42Task-Parameter Quick Reference
| Task | Temperature | Top-P | Frequency Penalty | Notes | |---|---|---|---|---| | Drug interaction facts | 0.0 | 1.0 | 0.0 | Deterministic ā one right answer | | Clinical calculations | 0.0 | 1.0 | 0.0 | Never sample for math | | Structured JSON extraction | 0.0ā0.2 | 1.0 | 0.0 | Low variance needed | | Medical Q&A | 0.3ā0.5 | 0.9 | 0.1 | Some diversity, mostly accurate | | Patient counseling text | 0.7 | 0.9 | 0.3 | Natural language, not robotic | | Creative patient scenarios | 1.0 | 0.95 | 0.5 | Variety needed | | Brainstorming diagnoses | 1.0ā1.2 | 1.0 | 0.8 | Maximum diversity |
Diagnosing Sampling Problems
def diagnose_output_quality(
prompt: str,
n_samples: int = 10,
temperature: float = 1.0,
) -> dict:
"""Diagnose output consistency and quality at a given temperature."""
outputs = [generate_with_params(prompt, temperature=temperature) for _ in range(n_samples)]
# Check for repetition within responses
repetition_scores = []
for output in outputs:
words = output.lower().split()
unique_ratio = len(set(words)) / max(len(words), 1)
repetition_scores.append(unique_ratio)
# Check for inter-response consistency
lengths = [len(o.split()) for o in outputs]
return {
"temperature": temperature,
"avg_length": sum(lengths) / len(lengths),
"length_variance": max(lengths) - min(lengths),
"avg_unique_word_ratio": sum(repetition_scores) / len(repetition_scores),
"all_outputs_identical": len(set(outputs)) == 1,
"n_unique_outputs": len(set(outputs)),
}
# If all_outputs_identical at temp=0.7 ā output too constrained
# If avg_unique_word_ratio < 0.5 ā severe repetition problem ā increase frequency_penalty
# If n_unique_outputs == n_samples ā very high variance ā reduce temperatureAnthropic Claude Parameter Equivalents
Claude uses temperature and top_p similarly to OpenAI, with one difference: Claude does not expose frequency_penalty or presence_penalty. Claude's RLHF training generally handles repetition better than early OpenAI models.
import anthropic
claude_client = anthropic.Anthropic()
response = claude_client.messages.create(
model="claude-sonnet-4-6",
max_tokens=500,
temperature=0.7, # 0.0ā1.0 (same interpretation as OpenAI)
top_p=0.9, # Nucleus sampling
# top_k also supported
messages=[{"role": "user", "content": "Explain warfarin's mechanism."}]
)
print(response.content[0].text)Note: For factual tasks with Claude, temperature=0 is not guaranteed to be deterministic (unlike OpenAI's seed + temperature=0 combination). Claude recommends temperature=0 for "more consistent outputs" but does not guarantee bit-identical results.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.