Skip to main content

Self-evaluation and calibration

Definition

Self-evaluation refers to prompting a language model to critique, verify, or score its own previously generated output. Rather than treating the model's first response as final, a self-evaluation step asks the model to act as its own reviewer — checking for factual errors, logical inconsistencies, incomplete reasoning, or failure to follow instructions — and then either to flag problems or to generate an improved response. The model uses the same weights and context window for both roles, which is both a strength (no additional model is needed) and a fundamental limitation (the model may have systematic blind spots it cannot self-detect).

Calibration is the narrower, quantitative dimension of self-evaluation. A model is well-calibrated if its expressed confidence matches its empirical accuracy: when it says it is 80% confident, it should be correct roughly 80% of the time. Most LLMs are poorly calibrated out of the box — they express high confidence even on questions they answer incorrectly, a phenomenon known as overconfidence or epistemic overreach. Calibration techniques prompt the model to produce an explicit numeric confidence score alongside each answer, and then the system can use that score to route uncertain answers to human review, to trigger additional verification steps, or to abstain from answering altogether.

Together, self-evaluation and calibration address two distinct but related failure modes. Self-evaluation addresses correctness: the model produced an answer, but is it right? Calibration addresses uncertainty awareness: does the model know when it doesn't know? Both are necessary for deploying LLMs in high-stakes settings. A model that catches its own errors is more reliable; a model that knows what it doesn't know is more trustworthy. The techniques covered here — self-critique, confidence scoring, and chain-of-verification — are increasingly standard components of production LLM pipelines.

How it works

Self-critique

Self-critique is the simplest self-evaluation method. After generating an initial response, you append a second prompt that asks the model to review its own output against explicit criteria. Good self-critique prompts are specific about what to check: factual accuracy, logical consistency, completeness, instruction adherence, tone, or safety. Vague prompts like "Is this response good?" produce shallow, superficial critiques. Specific prompts like "List any factual claims in the response that you are less than 90% confident about, and explain why" produce actionable feedback.

The quality of self-critique improves substantially when you instruct the model to adopt an adversarial stance — to actively look for problems rather than to confirm the response is fine. Phrases like "Challenge each key claim", "Find at least one flaw", and "What would a skeptic object to?" bias the model toward useful criticism rather than validation. Constitutional AI (Anthropic, 2022) systematizes this by defining a set of "principles" the model must check the response against before revising — effectively creating a structured critique rubric that can be audited.

A critical failure mode of self-critique is sycophantic validation: the model praises its own response and finds no problems, especially when the original response was already plausible-sounding but wrong. This is most pronounced in smaller models and least pronounced in models that have been fine-tuned with critique data. Mitigations include: using a separate model instance for critique, injecting deliberate errors into the draft to test if the critique step catches them, and requiring the critique to be a structured list rather than free prose (making "no issues" a harder claim to defend).

Calibration and confidence scoring

Confidence scoring prompts ask the model to produce an explicit probability or ordinal rating alongside each answer. A minimal version is a simple request appended to the answer prompt: "After your answer, state your confidence as a percentage from 0 to 100, where 100 means you are certain and 0 means you are guessing." More sophisticated versions ask for a breakdown by claim: "For each factual statement in your response, rate your confidence (high / medium / low) and identify the source of uncertainty."

Numeric confidence scores from LLMs must be treated with skepticism. Raw verbalized probabilities are not well-calibrated in the statistical sense — a model that says "70% confident" is not systematically right 70% of the time on those questions. However, they are monotonically useful: questions where the model reports low confidence tend to be harder and more error-prone than questions where it reports high confidence. This means verbalized confidence scores are useful for ranking and routing (send low-confidence answers to review) even if they are not useful for exact probability estimation.

Calibration can be improved post-hoc through temperature scaling or Platt scaling applied to the model's log-probabilities, but these require a labeled dataset. At the prompt level, you can improve relative calibration by asking the model to compare its confidence against reference questions of known difficulty ("I am as confident as I would be about the capital of France vs. an obscure historical date").

Chain-of-verification

Chain-of-verification (CoVe, Dhuliawala et al., 2023) structures self-evaluation as a multi-step verification pipeline: generate a baseline response, then explicitly plan a set of verification questions that would confirm or refute the key claims in that response, answer those verification questions independently (without looking at the original response to reduce confirmation bias), and finally produce a revised response informed by the verification results. This decomposition is important because it forces the model to separate claim generation from claim verification, reducing the chance that the same reasoning error propagates through both steps.

The verification questions should be atomic — each should test a single, specific sub-claim. For example, if the baseline response states "Python 3.10 introduced structural pattern matching and the walrus operator", the verification questions should be: "In which Python version was structural pattern matching introduced?" and "In which Python version was the walrus operator introduced?" Answering these independently often surfaces factual errors that the original response confidently asserted.

When to use / When NOT to use

Use whenAvoid when
The task is high-stakes and factual correctness is critical (medical, legal, financial)Latency is a hard constraint — self-evaluation adds at least one full inference round-trip
You want a built-in uncertainty signal without a separate evaluator modelThe model's domain is one where self-evaluation is systematically unreliable (e.g., very recent events beyond training cutoff)
Output quality is highly variable across runs and you need a filtering mechanismThe task is simple and well-constrained — self-evaluation overhead exceeds accuracy benefit
You need to route uncertain answers to human review automaticallyThe model is too small to produce reliable self-critiques (< 7B parameters typically yields poor self-evaluation)
Responses contain multiple independent factual claims that can be verified atomicallyYou need exact probability calibration — verbalized confidence scores are not statistically calibrated
Building a pipeline where the model must detect its own hallucinationsThe original generation is already at ceiling accuracy — self-critique adds cost with no accuracy gain

Comparisons

CriterionSelf-evaluationSelf-consistencyExternal evaluation
Additional model calls1–3 (critique, score, verify)N (typically 10–40)1 (separate evaluator)
Requires separate modelNo — same model reviews itselfNoYes — typically a stronger or specialized model
Catches factual errorsYes, if self-critique is well-promptedPartially — inconsistent facts may survive majority voteYes, more reliably
Provides uncertainty scoreYes — explicit confidence ratingImplicit — vote spread is a proxy for confidenceYes — evaluator can output a score
Reduces hallucinationYes, especially with CoVePartially — voting reduces but does not eliminate hallucinationMore reliably, but adds cost and latency
Implementation effortModerate — requires careful critique prompt designLow — sample N times and voteHigh — requires evaluator prompt, separate API call, possibly a separate model
Best use caseSingle-turn high-stakes Q&A, factual generationMulti-step math and reasoningEnterprise pipelines with strong correctness requirements

Code examples

Self-evaluation with critique step using Anthropic SDK

# Self-evaluation pipeline: generate → critique → score → revise
# pip install anthropic

import os
from anthropic import Anthropic

client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
MODEL = "claude-opus-4-5"


def generate_initial(question: str) -> str:
"""Step 1: Generate an initial response."""
response = client.messages.create(
model=MODEL,
max_tokens=512,
messages=[{"role": "user", "content": question}],
)
return response.content[0].text.strip()


def critique_response(question: str, response: str) -> str:
"""Step 2: Critique the initial response for errors and gaps."""
prompt = f"""You are a rigorous fact-checker and critic. Review the response below and identify:
1. Any factual claims you are less than fully confident about
2. Logical inconsistencies or gaps in reasoning
3. Missing context that would be important for the user

Question: {question}

Response to critique:
{response}

Provide a structured critique. If you find no issues, you must still explain why you believe the response is correct. Do not simply validate the response."""

critique = client.messages.create(
model=MODEL,
max_tokens=512,
messages=[{"role": "user", "content": prompt}],
)
return critique.content[0].text.strip()


def score_confidence(question: str, response: str, critique: str) -> dict:
"""Step 3: Produce an explicit confidence score based on the critique."""
prompt = f"""Given the question, the response, and the critique below, assign a confidence score.

Question: {question}

Response:
{response}

Critique:
{critique}

Output in this exact format:
CONFIDENCE: [integer 0-100]
REASONING: [one sentence explaining the score]
SHOULD_REVISE: [yes/no]"""

result = client.messages.create(
model=MODEL,
max_tokens=128,
messages=[{"role": "user", "content": prompt}],
)
text = result.content[0].text.strip()

# Parse structured output
confidence, reasoning, should_revise = None, "", False
for line in text.splitlines():
if line.startswith("CONFIDENCE:"):
try:
confidence = int(line.split(":", 1)[1].strip())
except ValueError:
pass
elif line.startswith("REASONING:"):
reasoning = line.split(":", 1)[1].strip()
elif line.startswith("SHOULD_REVISE:"):
should_revise = "yes" in line.lower()

return {"confidence": confidence, "reasoning": reasoning, "should_revise": should_revise}


def revise_response(question: str, initial: str, critique: str) -> str:
"""Step 4: Produce a revised response informed by the critique."""
prompt = f"""Revise the response below to address the issues identified in the critique.
Preserve correct information. Be explicit about any remaining uncertainty.

Question: {question}

Original response:
{initial}

Critique to address:
{critique}

Revised response:"""

revised = client.messages.create(
model=MODEL,
max_tokens=512,
messages=[{"role": "user", "content": prompt}],
)
return revised.content[0].text.strip()


def self_evaluate(question: str, confidence_threshold: int = 75) -> dict:
"""Full self-evaluation pipeline: generate, critique, score, conditionally revise."""
print("=== Step 1: Generating initial response ===")
initial = generate_initial(question)
print(initial[:200], "...\n" if len(initial) > 200 else "\n")

print("=== Step 2: Critiquing response ===")
critique = critique_response(question, initial)
print(critique[:200], "...\n" if len(critique) > 200 else "\n")

print("=== Step 3: Scoring confidence ===")
score = score_confidence(question, initial, critique)
print(f"Confidence : {score['confidence']}")
print(f"Reasoning : {score['reasoning']}")
print(f"Revise? : {score['should_revise']}\n")

final = initial
if score["should_revise"] or (score["confidence"] is not None and score["confidence"] < confidence_threshold):
print("=== Step 4: Revising response ===")
final = revise_response(question, initial, critique)
print(final[:200], "...\n" if len(final) > 200 else "\n")
else:
print("=== Step 4: Skipped — confidence above threshold ===\n")

return {
"question": question,
"initial_response": initial,
"critique": critique,
"confidence_score": score,
"final_response": final,
"was_revised": final != initial,
}


if __name__ == "__main__":
q = ("What were the main causes of the 2008 financial crisis, "
"and which regulatory changes were enacted in response?")
result = self_evaluate(q, confidence_threshold=80)
print("=== Final answer ===")
print(result["final_response"])
print(f"\nRevised: {result['was_revised']}")
print(f"Confidence: {result['confidence_score']['confidence']}")

Chain-of-verification for factual claims

# Chain-of-Verification (CoVe): decompose claims, verify independently, revise
# pip install anthropic

import os
from anthropic import Anthropic

client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
MODEL = "claude-opus-4-5"


def extract_verification_questions(response: str) -> list[str]:
"""Generate atomic verification questions for each factual claim."""
prompt = f"""Read the response below and generate a list of atomic verification questions
— one per distinct factual claim. Each question should be answerable independently
without referring to the original response.

Response:
{response}

Output as a numbered list of questions only. No preamble."""

result = client.messages.create(
model=MODEL,
max_tokens=400,
messages=[{"role": "user", "content": prompt}],
)
text = result.content[0].text.strip()
questions = []
for line in text.splitlines():
line = line.strip()
if line and line[0].isdigit():
# Strip leading number and punctuation
q = line.lstrip("0123456789.)- ").strip()
if q:
questions.append(q)
return questions


def verify_claim(question: str) -> dict:
"""Answer a single verification question independently."""
prompt = f"""Answer the following question as accurately as possible.
If you are uncertain, say so explicitly and explain why.

Question: {question}

Answer:"""

result = client.messages.create(
model=MODEL,
max_tokens=150,
messages=[{"role": "user", "content": prompt}],
)
answer = result.content[0].text.strip()
uncertain = any(w in answer.lower() for w in ("uncertain", "unsure", "not sure", "don't know", "unclear"))
return {"question": question, "answer": answer, "uncertain": uncertain}


def revise_with_verifications(original_response: str, verifications: list[dict]) -> str:
"""Produce a revised response informed by independent verification results."""
verification_block = "\n".join(
f"Q: {v['question']}\nA: {v['answer']}\n" for v in verifications
)
prompt = f"""Revise the response below using the independent verification answers provided.
Correct any inaccuracies. Where verifications indicate uncertainty, acknowledge that uncertainty explicitly.

Original response:
{original_response}

Independent verifications:
{verification_block}

Revised response:"""

result = client.messages.create(
model=MODEL,
max_tokens=600,
messages=[{"role": "user", "content": prompt}],
)
return result.content[0].text.strip()


def chain_of_verification(question: str) -> dict:
"""Full CoVe pipeline for a factual question."""
# Step 1: Baseline response
baseline = client.messages.create(
model=MODEL,
max_tokens=400,
messages=[{"role": "user", "content": question}],
).content[0].text.strip()

# Step 2: Plan verification questions
vqs = extract_verification_questions(baseline)
print(f"Generated {len(vqs)} verification questions.")

# Step 3: Answer each verification question independently
verifications = [verify_claim(q) for q in vqs]
uncertain_count = sum(1 for v in verifications if v["uncertain"])
print(f"Uncertain claims: {uncertain_count}/{len(verifications)}")

# Step 4: Revise using verification results
revised = revise_with_verifications(baseline, verifications)

return {
"question": question,
"baseline": baseline,
"verification_questions": vqs,
"verifications": verifications,
"revised": revised,
"uncertain_claims": uncertain_count,
}


if __name__ == "__main__":
q = "Summarize the key milestones in the development of transformer models from 2017 to 2023."
result = chain_of_verification(q)
print("\n=== Baseline ===")
print(result["baseline"])
print("\n=== Revised (after CoVe) ===")
print(result["revised"])
print(f"\nUncertain claims flagged: {result['uncertain_claims']}/{len(result['verifications'])}")

Practical resources

See also