Learnixo
Back to blog
AI Systemsintermediate

Temperature and Sampling Parameters

Control LLM output diversity with temperature, top-k, top-p, and repetition penalties. Learn when to use deterministic vs stochastic sampling for different task types.

Asma Hafeez KhanMay 16, 20267 min read
Prompt EngineeringTemperatureSamplingLLM
Share:š•

The Sampling Decision

After the model computes a probability distribution over the vocabulary, it must choose the next token. Sampling parameters control how this choice is made:

  • Deterministic (greedy): Always pick the highest-probability token
  • Stochastic: Sample randomly from the distribution, shaped by parameters

The distribution before sampling is a vector of logits — one per vocabulary token. The final token is selected from this shaped distribution.


Temperature

Temperature divides logits before softmax, controlling distribution sharpness:

Python
import numpy as np

def apply_temperature(logits: np.ndarray, temperature: float) -> np.ndarray:
    """Scale logits by temperature before softmax."""
    if temperature == 0:
        # Greedy: one-hot at argmax
        result = np.zeros_like(logits)
        result[np.argmax(logits)] = 1.0
        return result

    scaled = logits / temperature
    # Numerically stable softmax
    scaled -= scaled.max()
    exp_logits = np.exp(scaled)
    return exp_logits / exp_logits.sum()

# Example: 5-token vocabulary
logits = np.array([3.0, 2.0, 1.5, 0.5, 0.1])

for temp in [0.1, 0.5, 1.0, 1.5, 2.0]:
    probs = apply_temperature(logits, temp)
    print(f"T={temp:.1f}: {probs.round(3)}")

# T=0.1: [0.999, 0.001, 0.000, 0.000, 0.000]  — very peaked
# T=0.5: [0.887, 0.095, 0.016, 0.001, 0.000]  — peaked
# T=1.0: [0.658, 0.242, 0.066, 0.027, 0.007]  — model's true distribution
# T=1.5: [0.476, 0.274, 0.143, 0.072, 0.035]  — more spread
# T=2.0: [0.373, 0.264, 0.195, 0.113, 0.055]  — nearly uniform

Temperature guidelines:

  • T = 0: Fully deterministic — same input always produces same output
  • T ā‰ˆ 0.1–0.3: Near-deterministic with occasional variation
  • T ā‰ˆ 0.7–1.0: Balanced — good for general chat and creative tasks
  • T > 1.0: Increases randomness, may introduce incoherence

Top-K Sampling

Truncate the distribution to the K most probable tokens before sampling:

Python
def top_k_sample(logits: np.ndarray, k: int, temperature: float = 1.0) -> int:
    """Sample from top-k tokens only."""
    # Get top-k indices and values
    top_k_indices = np.argsort(logits)[-k:]
    top_k_logits = logits[top_k_indices]

    # Apply temperature
    probs = apply_temperature(top_k_logits, temperature)

    # Sample
    chosen_idx = np.random.choice(len(top_k_logits), p=probs)
    return top_k_indices[chosen_idx]

# k=1 is equivalent to greedy
# k=50 is OpenAI's default — allows meaningful diversity while preventing garbage
# k=vocabulary_size means no truncation (pure temperature sampling)

Problem with top-k: A fixed K can be too few (when the distribution is flat) or too many (when one token dominates). Top-k=50 means 50 tokens in all contexts — but sometimes 2 tokens cover 99% of probability; sometimes 1000 tokens each have meaningful probability.


Top-P (Nucleus Sampling)

Select the minimum set of tokens covering probability mass P, then sample from them:

Python
def top_p_sample(logits: np.ndarray, p: float = 0.9, temperature: float = 1.0) -> int:
    """Nucleus sampling: sample from tokens that collectively cover probability p."""
    # Apply temperature first
    probs = apply_temperature(logits, temperature)

    # Sort tokens by probability (descending)
    sorted_indices = np.argsort(-probs)
    sorted_probs = probs[sorted_indices]

    # Find the smallest set that covers cumulative probability >= p
    cumulative_probs = np.cumsum(sorted_probs)
    cutoff_idx = np.searchsorted(cumulative_probs, p) + 1  # +1 to include the cutoff token

    # Truncate and renormalize
    nucleus_indices = sorted_indices[:cutoff_idx]
    nucleus_probs = probs[nucleus_indices]
    nucleus_probs = nucleus_probs / nucleus_probs.sum()

    # Sample
    chosen_idx = np.random.choice(len(nucleus_probs), p=nucleus_probs)
    return nucleus_indices[chosen_idx]

# Practical examples:
# When p=0.9:
# - If top token has 95% probability → nucleus = 1 token (near-greedy)
# - If top 100 tokens each have 0.9% → nucleus = 100 tokens (more creative)
# This adaptive behavior is why top-p often outperforms top-k

Combining Temperature, Top-K, and Top-P

In production APIs, all three are applied together:

Python
from openai import OpenAI

client = OpenAI()

def generate_with_params(
    prompt: str,
    temperature: float = 1.0,
    top_p: float = 1.0,
    frequency_penalty: float = 0.0,
    presence_penalty: float = 0.0,
    max_tokens: int = 500,
) -> str:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        temperature=temperature,
        top_p=top_p,
        frequency_penalty=frequency_penalty,
        presence_penalty=presence_penalty,
        max_tokens=max_tokens,
    )
    return response.choices[0].message.content

# Task-appropriate configurations
CONFIGS = {
    "factual_qa": {
        "temperature": 0.0,
        "top_p": 1.0,
        "frequency_penalty": 0.0,
        "presence_penalty": 0.0,
    },
    "code_generation": {
        "temperature": 0.2,
        "top_p": 0.95,
        "frequency_penalty": 0.0,
        "presence_penalty": 0.0,
    },
    "general_chat": {
        "temperature": 0.7,
        "top_p": 0.9,
        "frequency_penalty": 0.3,
        "presence_penalty": 0.0,
    },
    "creative_writing": {
        "temperature": 1.0,
        "top_p": 0.95,
        "frequency_penalty": 0.5,
        "presence_penalty": 0.5,
    },
    "brainstorming": {
        "temperature": 1.2,
        "top_p": 1.0,
        "frequency_penalty": 0.8,
        "presence_penalty": 0.6,
    },
}

# Use appropriate config for each task
def generate_for_task(prompt: str, task: str) -> str:
    config = CONFIGS.get(task, CONFIGS["general_chat"])
    return generate_with_params(prompt, **config)

Repetition Penalties

Frequency penalty: Reduces probability of tokens proportional to how often they've appeared. Prevents verbatim repetition.

Presence penalty: Reduces probability of any token that has appeared at all (binary — present or not). Encourages topic diversity.

Python
# How frequency penalty works (simplified):
# logit_adjusted[token] = logit[token] - frequency_penalty Ɨ count(token)

# Frequency penalty = 0.0: no adjustment (default)
# Frequency penalty = 0.5: moderate repetition reduction
# Frequency penalty = 2.0: strong repetition avoidance (may cause incoherence)

# Example: prevents "the drug" "the drug" "the drug" repeating
response = generate_with_params(
    "Tell me about warfarin's properties.",
    frequency_penalty=0.5,  # Moderately penalize repeated words
)

Temperature for Reproducibility

When reproducibility matters (testing, debugging), use temperature=0 and a fixed seed:

Python
# OpenAI seed parameter for reproducibility
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is the warfarin-clarithromycin interaction?"}],
    temperature=0,
    seed=42,  # Ensures identical output for identical input
)

# system_fingerprint indicates model version — same fingerprint = same model weights
print(f"System fingerprint: {response.system_fingerprint}")
# If fingerprint changes, model was updated and may produce different output even with seed=42

Task-Parameter Quick Reference

| Task | Temperature | Top-P | Frequency Penalty | Notes | |---|---|---|---|---| | Drug interaction facts | 0.0 | 1.0 | 0.0 | Deterministic — one right answer | | Clinical calculations | 0.0 | 1.0 | 0.0 | Never sample for math | | Structured JSON extraction | 0.0–0.2 | 1.0 | 0.0 | Low variance needed | | Medical Q&A | 0.3–0.5 | 0.9 | 0.1 | Some diversity, mostly accurate | | Patient counseling text | 0.7 | 0.9 | 0.3 | Natural language, not robotic | | Creative patient scenarios | 1.0 | 0.95 | 0.5 | Variety needed | | Brainstorming diagnoses | 1.0–1.2 | 1.0 | 0.8 | Maximum diversity |


Diagnosing Sampling Problems

Python
def diagnose_output_quality(
    prompt: str,
    n_samples: int = 10,
    temperature: float = 1.0,
) -> dict:
    """Diagnose output consistency and quality at a given temperature."""
    outputs = [generate_with_params(prompt, temperature=temperature) for _ in range(n_samples)]

    # Check for repetition within responses
    repetition_scores = []
    for output in outputs:
        words = output.lower().split()
        unique_ratio = len(set(words)) / max(len(words), 1)
        repetition_scores.append(unique_ratio)

    # Check for inter-response consistency
    lengths = [len(o.split()) for o in outputs]

    return {
        "temperature": temperature,
        "avg_length": sum(lengths) / len(lengths),
        "length_variance": max(lengths) - min(lengths),
        "avg_unique_word_ratio": sum(repetition_scores) / len(repetition_scores),
        "all_outputs_identical": len(set(outputs)) == 1,
        "n_unique_outputs": len(set(outputs)),
    }

# If all_outputs_identical at temp=0.7 → output too constrained
# If avg_unique_word_ratio < 0.5 → severe repetition problem → increase frequency_penalty
# If n_unique_outputs == n_samples → very high variance → reduce temperature

Anthropic Claude Parameter Equivalents

Claude uses temperature and top_p similarly to OpenAI, with one difference: Claude does not expose frequency_penalty or presence_penalty. Claude's RLHF training generally handles repetition better than early OpenAI models.

Python
import anthropic

claude_client = anthropic.Anthropic()

response = claude_client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=500,
    temperature=0.7,  # 0.0–1.0 (same interpretation as OpenAI)
    top_p=0.9,        # Nucleus sampling
    # top_k also supported
    messages=[{"role": "user", "content": "Explain warfarin's mechanism."}]
)
print(response.content[0].text)

Note: For factual tasks with Claude, temperature=0 is not guaranteed to be deterministic (unlike OpenAI's seed + temperature=0 combination). Claude recommends temperature=0 for "more consistent outputs" but does not guarantee bit-identical results.

Enjoyed this article?

Explore the AI Systems learning path for more.

Found this helpful?

Share:š•

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.