Skip to main content

Automatic Prompt Engineering (APE)

Definition

Automatic Prompt Engineering (APE) is the practice of using a language model to generate and optimize prompt instructions rather than writing them by hand. Introduced by Zhou et al. (2022) in the paper Large Language Models Are Human-Level Prompt Engineers, APE frames prompt design as a program synthesis problem: given a set of input-output demonstration pairs, find the natural-language instruction that, when prepended to a prompt, maximizes task performance on a held-out evaluation set. The searching, scoring, and refinement of candidate instructions are all performed programmatically — the human's role shifts from prompt author to task definer and metric designer.

The motivation for automating prompt design is practical. Manual prompt engineering is time-consuming, brittle, and biased by the human engineer's intuitions about how language models process text. Small changes in wording — "Think step by step" vs "Let's think carefully step by step" — produce measurable accuracy differences that are impossible to predict without empirical testing. APE replaces this guesswork with systematic search: generate a large pool of candidate instructions, evaluate each on a benchmark, and keep the best performers. This is the same design philosophy behind hyperparameter search in classical ML — humans specify the objective, machines do the search.

APE is distinct from soft prompt tuning (which optimizes continuous token embeddings via gradient descent) and from fine-tuning (which updates model weights). APE operates entirely in natural-language space using frozen models. This makes it model-agnostic, interpretable — you can read and understand the winning instruction — and deployable without any training infrastructure. The trade-off is that the discrete search space of natural language is vast and non-differentiable, so APE relies on sampling, scoring heuristics, and iterative refinement rather than gradient-based optimization.

How it works

Candidate generation

The APE loop starts with a set of demonstration examples — input-output pairs that illustrate the target task. These examples are passed to a meta-LLM (the same or a different model) with a meta-prompt that asks it to infer the instruction that would produce the given outputs from the given inputs. Typical meta-prompts look like: "Here are input-output pairs. What is the instruction that produces these outputs? Generate 10 diverse candidate instructions." By sampling at temperature > 0, the meta-LLM produces a diverse pool of candidate instructions that differ in wording, framing, and specificity. The quality and diversity of this initial pool directly determines the ceiling of the optimization.

Scoring

Each candidate instruction is instantiated as a prefix in the prompt (or as the system message) and evaluated against a held-out benchmark. The scoring function is task-specific: accuracy for classification, execution correctness for code generation, ROUGE or BERTScore for summarization, or a secondary LLM judge for open-ended tasks. The key design decision is whether the score is computed with the meta-LLM itself (using log-probability estimates of correct outputs) or with a separate task-specific evaluator. Log-probability scoring is faster but can overfit to the meta-LLM's calibration. Separate evaluator scoring is more reliable but requires labeled data.

Iterative refinement

After the initial scoring, the top-K candidate instructions are selected for refinement. The meta-LLM is prompted to paraphrase, extend, or combine the best candidates — yielding a new pool of variants that are semantically related but textually distinct. This refinement loop runs for a fixed number of iterations or until a target score threshold is reached. Each iteration narrows the search around promising regions of instruction space, analogous to evolutionary search or hill climbing over a discrete landscape. In practice, one or two rounds of refinement after a large initial pool (N ≥ 50) tends to recover most of the achievable gain.

Comparisons

CriterionAPEManual prompt engineeringFine-tuning
Human effortLow — define task and metricHigh — iterative authoring and testingHigh — data collection and training runs
Requires labeled dataYes — for scoringNo — can be done empiricallyYes — typically thousands of examples
Model weights updatedNoNoYes
Output interpretableYes — natural-language instructionYesNo — weight changes are opaque
Generalizes across modelsYes — re-run search per modelPartiallyNo — tied to base model
Latency at inferenceNone — no runtime overheadNoneNone
CostMedium — N × M evaluation callsLowHigh — GPU time
Best forTasks with a clear metric and ≥ 50 examplesNovel tasks without a metricHigh-volume tasks where accuracy gains justify training

When to use / When NOT to use

Use whenAvoid when
You have a labeled evaluation set and can define a clear scoring metricThe task has no reliable automated metric — APE cannot search without a signal
Manual prompt iteration is taking more than a day and accuracy still plateausYou need a result immediately — APE requires multiple LLM API calls for evaluation
You are deploying the same prompt to many users and even 1–2% accuracy gains matterYour demonstration pool is too small (< 10 examples) — scoring will be noisy
You want to audit the best-found instruction for safety before deploymentThe task requires creativity or subjective judgment where a single metric is misleading
You are using DSPy or a similar framework where prompt optimization is built inFine-tuning is already planned — APE optimizes prompts, not weights

Code examples

Basic APE loop with OpenAI

# Minimal APE implementation: generate instructions, score, return best
# pip install openai

import os
import re
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# ----- Task definition --------------------------------------------------------
# Demonstrations: pairs of (input, expected_output)
DEMOS = [
("The movie was absolutely fantastic, I loved every minute.", "positive"),
("Terrible film, waste of time and money.", "negative"),
("It was okay, nothing special but not bad either.", "neutral"),
("A masterpiece of modern cinema.", "positive"),
("I walked out after 20 minutes.", "negative"),
]

# Held-out evaluation set for scoring
EVAL_SET = [
("A stunning visual experience with weak writing.", "positive"), # debatable but positive
("Boring, predictable, and too long.", "negative"),
("I enjoyed it more than I expected.", "positive"),
("Neither good nor bad — forgettable.", "neutral"),
("One of the best films of the decade.", "positive"),
]


# ----- Step 1: Generate candidate instructions --------------------------------
def generate_instructions(demos: list[tuple[str, str]], n: int = 10) -> list[str]:
"""Ask a meta-LLM to infer N candidate instructions from demo pairs."""
demo_text = "\n".join(f'Input: "{inp}"\nOutput: "{out}"' for inp, out in demos)
meta_prompt = (
f"Here are input-output example pairs for a text classification task:\n\n"
f"{demo_text}\n\n"
f"Generate {n} diverse natural-language instructions that, when prepended to "
f"an input text, would cause a language model to produce the correct output. "
f"Return one instruction per line, numbered."
)
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": meta_prompt}],
temperature=0.9,
max_tokens=800,
)
raw = resp.choices[0].message.content
lines = [re.sub(r"^\d+[\.\)]\s*", "", l).strip() for l in raw.splitlines()]
return [l for l in lines if len(l) > 20] # filter out empty / too-short lines


# ----- Step 2: Score an instruction on the eval set --------------------------
def score_instruction(instruction: str, eval_set: list[tuple[str, str]]) -> float:
"""Return accuracy of the instruction on the eval set."""
correct = 0
for text, expected in eval_set:
prompt = f"{instruction}\n\nText: {text}\nLabel:"
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
temperature=0,
max_tokens=5,
)
prediction = resp.choices[0].message.content.strip().lower()
if expected.lower() in prediction:
correct += 1
return correct / len(eval_set)


# ----- Step 3: Iterative refinement of top-K instructions --------------------
def refine_instructions(top_instructions: list[str], n_variants: int = 5) -> list[str]:
"""Ask the meta-LLM to paraphrase the top instructions to get variants."""
instr_text = "\n".join(f"- {i}" for i in top_instructions)
refine_prompt = (
f"Here are high-performing instructions for a sentiment classification task:\n"
f"{instr_text}\n\n"
f"Generate {n_variants} new instructions that paraphrase or combine the above "
f"to potentially improve performance. Return one per line."
)
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": refine_prompt}],
temperature=0.7,
max_tokens=500,
)
raw = resp.choices[0].message.content
lines = [l.strip().lstrip("- ") for l in raw.splitlines()]
return [l for l in lines if len(l) > 20]


# ----- APE main loop ---------------------------------------------------------
def run_ape(
demos: list[tuple[str, str]],
eval_set: list[tuple[str, str]],
n_candidates: int = 10,
top_k: int = 3,
n_refinement_rounds: int = 1,
) -> dict:
print("=== APE: Generating initial candidates ===")
candidates = generate_instructions(demos, n=n_candidates)
print(f"Generated {len(candidates)} candidates.\n")

all_scored: list[tuple[str, float]] = []

for round_num in range(n_refinement_rounds + 1):
print(f"--- Round {round_num + 1}: Scoring {len(candidates)} instructions ---")
round_scores = []
for instr in candidates:
score = score_instruction(instr, eval_set)
round_scores.append((instr, score))
print(f" [{score:.0%}] {instr[:80]}{'...' if len(instr) > 80 else ''}")
all_scored.extend(round_scores)

if round_num < n_refinement_rounds:
top = [i for i, _ in sorted(round_scores, key=lambda x: -x[1])[:top_k]]
candidates = refine_instructions(top, n_variants=n_candidates // 2)
print()

best_instr, best_score = max(all_scored, key=lambda x: x[1])
return {"instruction": best_instr, "score": best_score, "all_scored": all_scored}


if __name__ == "__main__":
result = run_ape(DEMOS, EVAL_SET, n_candidates=8, top_k=3, n_refinement_rounds=1)
print(f"\n=== Best instruction (accuracy {result['score']:.0%}) ===")
print(result["instruction"])

Using DSPy for structured APE

# DSPy provides a higher-level abstraction for automatic prompt optimization.
# pip install dspy-ai

import dspy

# Configure DSPy with your LLM backend
lm = dspy.LM("openai/gpt-4o-mini", api_key=os.environ["OPENAI_API_KEY"])
dspy.configure(lm=lm)


# Define the task as a DSPy signature
class SentimentClassifier(dspy.Signature):
"""Classify the sentiment of a movie review as positive, negative, or neutral."""
review: str = dspy.InputField(desc="A movie review text")
sentiment: str = dspy.OutputField(desc="One of: positive, negative, neutral")


# Wrap in a module
class SentimentModule(dspy.Module):
def __init__(self):
self.classify = dspy.Predict(SentimentClassifier)

def forward(self, review: str) -> dspy.Prediction:
return self.classify(review=review)


# Training examples
trainset = [
dspy.Example(review=inp, sentiment=out).with_inputs("review")
for inp, out in [
("Absolutely loved it!", "positive"),
("Worst movie ever.", "negative"),
("It was fine, nothing memorable.", "neutral"),
]
]


# Use MIPROv2 optimizer to automatically engineer the prompt
def optimize_with_dspy():
module = SentimentModule()
optimizer = dspy.MIPROv2(metric=dspy.evaluate.answer_exact_match, auto="light")
optimized = optimizer.compile(module, trainset=trainset)
print(optimized.classify.extended_signature) # shows the optimized instruction
return optimized


if __name__ == "__main__":
optimized_module = optimize_with_dspy()
result = optimized_module(review="A surprisingly moving and well-acted drama.")
print(result.sentiment)

Practical resources

See also