Eval
@agentskit/eval runs structured evaluation suites against your agents. Results include accuracy, latency, and token usage — suitable for CI/CD gates and regression tracking.
When to use
- You have a stable
AgentFn(string in → string or structured content out) and want regression metrics. - You gate releases on
minAccuracyor track token spend across cases.
Installation
npm install @agentskit/eval
Running an Eval
import { runEval } from '@agentskit/eval'
const results = await runEval({
agent: myAgent,
suite: mySuite,
})
console.log(results.accuracy) // 0.92
console.log(results.avgLatencyMs) // 1240
console.log(results.totalTokens) // 8432
Defining a Suite
An EvalSuite groups related test cases under a name:
import type { EvalSuite } from '@agentskit/eval'
const mySuite: EvalSuite = {
name: 'Customer support — basic queries',
cases: [
{
input: 'What is your return policy?',
expected: 'returns', // string: passes if output includes this substring
},
{
input: 'How do I reset my password?',
expected: (output) => output.toLowerCase().includes('email'),
},
],
}
The AgentFn Type
runEval accepts any function that matches AgentFn:
type AgentFnOutput = string | { content: string; tokenUsage?: TokenUsage }
type AgentFn = (input: string) => Promise<AgentFnOutput>
Return a plain string for simple cases. Return an object with tokenUsage to have token metrics included in the report:
const agent: AgentFn = async (input) => {
const result = await myAgent.run(input)
return {
content: result.text,
tokenUsage: {
inputTokens: result.usage.input_tokens,
outputTokens: result.usage.output_tokens,
},
}
}
Expected Values
| Expected type | Pass condition |
|---|---|
string | Output includes the expected string (case-sensitive) |
(output: string) => boolean | Function returns true |
EvalTestCase
| Field | Type | Required | Description |
|---|---|---|---|
input | string | Yes | Prompt sent to the agent |
expected | string | (output: string) => boolean | Yes | Acceptance criterion |
label | string | No | Human-readable name shown in reports |
Metrics
runEval returns an EvalReport with the following fields:
| Field | Type | Description |
|---|---|---|
accuracy | number | Fraction of cases that passed (0–1) |
passed | number | Count of passing cases |
failed | number | Count of failing cases |
avgLatencyMs | number | Mean time per agent call |
totalTokens | number | null | Combined input + output tokens (null if not reported) |
cases | CaseResult[] | Per-case breakdown |
Error Handling
By default, errors thrown by the agent are recorded and the case is marked as failed — the suite continues running. No single error aborts the whole run.
// A failing case looks like:
{
input: 'crash prompt',
passed: false,
error: Error('rate limit exceeded'),
latencyMs: 312,
}
Pass { throwOnError: true } to halt on the first error instead:
const results = await runEval({ agent, suite, throwOnError: true })
CI/CD Usage
Use the exit code to gate deployments. runEval throws if accuracy falls below a threshold:
const results = await runEval({
agent,
suite,
minAccuracy: 0.9, // fails the process if accuracy < 90%
})
Example GitHub Actions step:
- name: Run agent evals
run: npx tsx evals/run.ts
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
Keep eval suites small (10–50 cases) for fast CI feedback. Run larger regression suites on a schedule.
Troubleshooting
| Issue | Mitigation |
|---|---|
| Flaky substring matches | Prefer predicate expected functions; avoid over-specific quotes. |
Null totalTokens | Return tokenUsage from AgentFn when the adapter exposes usage. |
| CI timeouts | Reduce suite size, mock network tools, or use a faster model for smoke evals. |
See also
Start here · Packages · TypeDoc (@agentskit/eval) · Observability · Sandbox · Runtime · @agentskit/core