Prompt ensembling

Definition

Prompt ensembling is a prompting technique that generates multiple structurally different formulations of the same question or task, submits all of them to a language model, and then combines the resulting outputs into a single final answer. The core intuition is borrowed from classical machine learning ensembles (bagging, boosting, stacking): no single predictor is perfect, but a diverse committee of imperfect predictors tends to be more reliable than any individual member, because their errors are partially uncorrelated and therefore cancel out in aggregation.

The critical distinction between prompt ensembling and self-consistency is the source of diversity. In self-consistency, you run the same prompt N times at temperature > 0 and rely on stochastic sampling to produce diverse reasoning paths. In prompt ensembling, you deliberately craft different prompts — varying the framing, the role assignment, the instruction phrasing, the few-shot examples, or the output format — and run each one (typically at temperature 0 or low temperature) to produce diverse yet deterministic outputs. Self-consistency exploits variance introduced by sampling; prompt ensembling exploits variance introduced by prompt design. In practice, the two approaches are complementary and can be combined.

Prompt ensembling is especially valuable in two scenarios. First, when you are uncertain which prompt formulation is optimal for a task and cannot evaluate alternatives at scale — running multiple candidates and voting over their outputs gives you the benefit of the best prompt without having to identify it in advance. Second, when a task is high-stakes and a single prompt's failure mode is unacceptable — an ensemble provides a soft audit trail, because the spread of votes across different answers is a direct signal of the model's uncertainty. The main cost is latency and tokens: K prompt variants require K inference calls, which can be parallelized but not eliminated.

How it works

Prompt variation strategies

The quality of an ensemble depends heavily on the diversity of the prompt variants. If all variants are superficially different but structurally identical, the ensemble degenerates toward repeated sampling. Effective variation strategies include:

Role and persona variation. Assigning different expert personas (e.g., "You are a cautious medical doctor", "You are a data scientist", "You are a pragmatic engineer") shifts the model's prior over plausible answers and activates different knowledge registers. Role variation is especially effective for tasks with multiple valid framings.

Instruction phrasing variation. The same task can be phrased as a question ("What is the risk level of...?"), a command ("Assess the risk level of..."), or a completion ("The risk level of ... is"), and these surface differences measurably change the model's output distribution. Paraphrasing the core instruction is the lowest-effort form of variation.

Few-shot example variation. Using different sets of in-context examples changes which part of the model's knowledge the few-shot context activates. Rotating through example sets drawn from different sub-domains of the training distribution increases ensemble diversity substantially, especially for classification tasks.

Chain-of-thought vs. direct answer variation. Including one or more CoT variants alongside direct-answer variants combines the reasoning-quality benefits of CoT with the speed benefits of direct prompting. The CoT variants typically receive more weight in aggregation because they are more reliable, but direct variants can override in cases where CoT leads the model into over-thinking simple questions.

Output format variation. Asking for the answer as a JSON object, as a numbered list, or as a free-text sentence can elicit different levels of precision. Structured output variants are easier to parse and aggregate programmatically.

Aggregation methods

Once you have K outputs, you need to reduce them to a single answer. The choice of aggregation method should match the output type:

Majority vote works best for discrete outputs (classification labels, short factual answers, multiple-choice selections). It is robust to adversarial or confused variants, requires no additional model calls, and directly mimics how self-consistency operates. Ties can be broken by log-probability or by deferring to a designated "trusted" variant.

Score averaging is appropriate when each variant returns a numeric score or probability rather than a label. Averaging is sensitive to outliers; median aggregation is more robust when individual variants can produce extreme values.

Meta-prompt (LLM-as-judge) aggregation sends all K outputs to a second LLM call that is instructed to synthesize or select the best answer. This is the most powerful but most expensive method, and it introduces a second point of LLM failure. It is most useful when the task requires open-ended generation (summaries, code, essays) where majority vote is not applicable.

Weighted voting assigns different weights to different variants based on their historical accuracy on a held-out validation set. If you have labeled data and can measure which variants perform best, weighting significantly outperforms uniform voting — but it requires calibration effort upfront.

When to use / When NOT to use

Use when	Avoid when
You are uncertain which prompt phrasing works best and cannot evaluate them individually at scale	Latency is a hard constraint — K parallel calls still have the latency of the slowest call
The task is high-stakes and a single prompt's failure mode is unacceptable	Token budget is severely limited and you cannot afford K completions
Outputs from different prompt framings provide complementary perspectives (e.g., medical diagnosis from multiple specialist angles)	The model already achieves ceiling accuracy with a single well-tuned prompt — diminishing returns
You want a built-in uncertainty signal (spread of votes = model disagreement)	The output space is continuous or open-ended in a way that makes voting or averaging meaningless
You are building a production pipeline where prompt sensitivity must be dampened	You lack the engineering infrastructure to run and aggregate parallel LLM calls

Comparisons

Criterion	Prompt ensembling	Self-consistency	Single prompt
Source of diversity	Different prompt designs	Stochastic sampling of one prompt	None
Number of LLM calls	K (number of variants, typically 3–10)	N (typically 10–40)	1
Temperature	Low (0–0.3) per variant	High (0.5–0.8)	Task-dependent
Accuracy improvement	High for tasks sensitive to prompt phrasing	High for multi-step reasoning	Baseline
Requires prompt engineering effort	Yes — designing diverse variants	No — only one prompt needed	Moderate
Handles open-ended output	Yes, via meta-prompt aggregation	No — majority vote requires discrete answers	Yes
Best use case	Tasks with prompt sensitivity or multiple valid framings	Math, symbolic reasoning, factual QA	Simple, well-defined tasks with a known good prompt

Code examples