Temperature, Top-K, Top-P

Definition

Temperature, Top-K, and Top-P are sampling parameters that control how an LLM selects the next token during text generation. After the model computes a probability distribution over its entire vocabulary (via softmax over logits), these parameters shape which tokens are candidates for selection and how likely each candidate is to be chosen. Together they govern the trade-off between determinism and diversity: low values make the model predictable and focused, high values make it creative and varied.

Temperature rescales the raw logits before the softmax step, effectively flattening or sharpening the probability distribution. A temperature of 1.0 leaves the distribution unchanged. Values below 1.0 make the distribution more peaked — the model almost always picks the highest-probability token. Values above 1.0 flatten the distribution — more tokens become plausible candidates, producing more surprising and varied outputs. At temperature 0, generation becomes deterministic (argmax decoding).

Top-K and Top-P are truncation strategies applied after temperature scaling. Top-K keeps only the K most probable tokens and redistributes probability mass among them, discarding all others. Top-P (also called nucleus sampling) dynamically selects the smallest set of tokens whose cumulative probability mass reaches a threshold P, then samples from that set. Top-P is generally preferred over Top-K because the size of the candidate set adapts to the shape of the distribution: when the model is confident, the nucleus is small; when the model is uncertain, the nucleus expands to include more alternatives.

How it works

The parameters are applied sequentially: temperature scaling first, then Top-K truncation, then Top-P nucleus selection, then sampling. In practice, most APIs apply only temperature + Top-P (the OpenAI default) or temperature + Top-K (the Anthropic default); applying both Top-K and Top-P together is possible but unusual.

Temperature

Temperature T divides each raw logit z_i before the softmax: p_i = softmax(z / T). When T < 1, the logit differences are amplified — the highest-probability token gets an even larger share of the probability mass. When T > 1, logit differences shrink — probability mass spreads out more evenly. Common presets: T = 0 for deterministic extraction tasks, T = 0.2–0.4 for factual Q&A, T = 0.7–1.0 for creative writing, T > 1.0 for maximum diversity (though quality degrades at extreme values).

Top-K

Top-K sampling restricts the candidate pool to the K tokens with the highest probability after temperature scaling. All tokens outside the top K are assigned probability zero before renormalization. The key limitation is that K is fixed regardless of how the distribution looks: when the model is very confident, even K=50 might include many near-zero-probability tokens that introduce noise; when the model is uncertain, a small K might cut off reasonable alternatives. Anthropic's API exposes top_k as a direct parameter; OpenAI's API does not support it natively.

Top-P (nucleus sampling)

Top-P sampling builds the candidate set dynamically. Starting from the most probable token and working downward, tokens are added to the nucleus until their cumulative probability reaches the threshold P. Only tokens in the nucleus are considered for sampling. With P = 0.9, the model samples from whichever tokens together account for 90% of the probability mass. Because the nucleus contracts when the model is confident (a few tokens dominate) and expands when it is uncertain (probability mass is spread thin), Top-P naturally adapts to the model's internal state. Top-P is supported by both OpenAI (top_p) and Anthropic (top_p) APIs.

When to use / When NOT to use

Scenario	Recommended settings	Avoid
Factual Q&A, data extraction, classification	`temperature=0–0.2`, `top_p=1.0` for near-deterministic output	High temperature; introduces hallucinations and format errors
Creative writing, brainstorming, ideation	`temperature=0.8–1.0`, `top_p=0.95` for diverse, novel outputs	Temperature=0; produces repetitive, predictable text
Code generation	`temperature=0.2–0.4`, `top_p=0.95`; some variation helps avoid local optima	Temperature > 0.8; syntax errors and logic drift increase
Self-consistency (multiple reasoning paths)	`temperature=0.6–1.0`; diversity is intentional	Temperature=0; all paths would be identical, defeating the purpose
Structured output extraction (JSON, tables)	`temperature=0`, `top_p=1.0` for strict schema adherence	Top-P < 0.9 combined with high temperature; schema violations spike
Dialogue / chatbots	`temperature=0.5–0.7`, `top_p=0.9`; balances coherence with naturalness	Extreme temperature in either direction; too robotic or too incoherent

Code examples

OpenAI — temperature and Top-P

# OpenAI API call with temperature and top_p
# pip install openai

import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])


def generate(prompt: str, temperature: float = 0.7, top_p: float = 0.95) -> str:
    """Generate text with configurable sampling parameters."""
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        temperature=temperature,
        top_p=top_p,
        max_tokens=512,
    )
    return response.choices[0].message.content


if __name__ == "__main__":
    # Deterministic factual extraction
    factual = generate(
        "List the three primary colors.",
        temperature=0.0,
        top_p=1.0,
    )
    print("Factual:", factual)

    # Creative brainstorming
    creative = generate(
        "Suggest five unusual names for a café that serves only breakfast.",
        temperature=0.9,
        top_p=0.95,
    )
    print("Creative:", creative)

Anthropic — temperature and Top-K

# Anthropic API call with temperature and top_k
# pip install anthropic

import os
import anthropic

client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])


def generate(prompt: str, temperature: float = 0.7, top_k: int = 50) -> str:
    """Generate text with configurable temperature and top-k sampling."""
    message = client.messages.create(
        model="claude-opus-4-5",
        max_tokens=512,
        temperature=temperature,
        top_k=top_k,
        messages=[{"role": "user", "content": prompt}],
    )
    return message.content[0].text


if __name__ == "__main__":
    # Near-deterministic output for structured tasks
    deterministic = generate(
        "Translate 'hello world' into French, German, and Japanese.",
        temperature=0.0,
        top_k=1,
    )
    print("Deterministic:", deterministic)

    # Creative output with broader candidate pool
    creative = generate(
        "Write the opening sentence of a science fiction novel set on Europa.",
        temperature=1.0,
        top_k=250,
    )
    print("Creative:", creative)

Practical resources

OpenAI — API reference: temperature and top_p — Official parameter documentation with valid ranges and defaults
Anthropic — API reference: temperature, top_k, top_p — Anthropic's parameter reference including top_k (not available in OpenAI)
The Nucleus Sampling paper (Holtzman et al., 2020) — Original paper introducing Top-P / nucleus sampling with motivation and empirical results
Hugging Face — Text generation strategies — Comprehensive guide to sampling strategies including greedy, beam search, temperature, Top-K, and Top-P
Lilian Weng — Controllable text generation — Deep-dive blog post covering sampling methods in the context of controllable generation

Definition​

How it works​

Temperature​

Top-K​

Top-P (nucleus sampling)​

When to use / When NOT to use​

Code examples​

OpenAI — temperature and Top-P​

Anthropic — temperature and Top-K​

Practical resources​

See also​