Eval

@agentskit/eval runs structured evaluation suites against your agents. Results include accuracy, latency, and token usage — suitable for CI/CD gates and regression tracking.

When to use

You have a stable AgentFn (string in → string or structured content out) and want regression metrics.
You gate releases on minAccuracy or track token spend across cases.

Installation

npm install @agentskit/eval

Running an Eval

import { runEval } from '@agentskit/eval'

const results = await runEval({
  agent: myAgent,
  suite: mySuite,
})

console.log(results.accuracy)    // 0.92
console.log(results.avgLatencyMs) // 1240
console.log(results.totalTokens)  // 8432

Defining a Suite

An EvalSuite groups related test cases under a name:

import type { EvalSuite } from '@agentskit/eval'

const mySuite: EvalSuite = {
  name: 'Customer support — basic queries',
  cases: [
    {
      input: 'What is your return policy?',
      expected: 'returns',  // string: passes if output includes this substring
    },
    {
      input: 'How do I reset my password?',
      expected: (output) => output.toLowerCase().includes('email'),
    },
  ],
}

The AgentFn Type

runEval accepts any function that matches AgentFn:

type AgentFnOutput = string | { content: string; tokenUsage?: TokenUsage }

type AgentFn = (input: string) => Promise<AgentFnOutput>

Return a plain string for simple cases. Return an object with tokenUsage to have token metrics included in the report:

const agent: AgentFn = async (input) => {
  const result = await myAgent.run(input)
  return {
    content: result.text,
    tokenUsage: {
      inputTokens: result.usage.input_tokens,
      outputTokens: result.usage.output_tokens,
    },
  }
}

Expected Values

Expected type	Pass condition
`string`	Output includes the expected string (case-sensitive)
`(output: string) => boolean`	Function returns `true`

EvalTestCase

Field	Type	Required	Description
`input`	`string`	Yes	Prompt sent to the agent
`expected`	`string \| (output: string) => boolean`	Yes	Acceptance criterion
`label`	`string`	No	Human-readable name shown in reports

Metrics

runEval returns an EvalReport with the following fields:

Field	Type	Description
`accuracy`	`number`	Fraction of cases that passed (0–1)
`passed`	`number`	Count of passing cases
`failed`	`number`	Count of failing cases
`avgLatencyMs`	`number`	Mean time per agent call
`totalTokens`	`number \| null`	Combined input + output tokens (null if not reported)
`cases`	`CaseResult[]`	Per-case breakdown

Error Handling

By default, errors thrown by the agent are recorded and the case is marked as failed — the suite continues running. No single error aborts the whole run.

// A failing case looks like:
{
  input: 'crash prompt',
  passed: false,
  error: Error('rate limit exceeded'),
  latencyMs: 312,
}

Pass { throwOnError: true } to halt on the first error instead:

const results = await runEval({ agent, suite, throwOnError: true })

CI/CD Usage

Use the exit code to gate deployments. runEval throws if accuracy falls below a threshold:

const results = await runEval({
  agent,
  suite,
  minAccuracy: 0.9, // fails the process if accuracy < 90%
})

Example GitHub Actions step:

- name: Run agent evals
  run: npx tsx evals/run.ts
  env:
    ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

Keep eval suites small (10–50 cases) for fast CI feedback. Run larger regression suites on a schedule.

Troubleshooting

Issue	Mitigation
Flaky substring matches	Prefer predicate `expected` functions; avoid over-specific quotes.
Null `totalTokens`	Return `tokenUsage` from `AgentFn` when the adapter exposes usage.
CI timeouts	Reduce suite size, mock network tools, or use a faster model for smoke evals.

When to use​

Installation​

Running an Eval​

Defining a Suite​

The AgentFn Type​

Expected Values​

EvalTestCase​

Metrics​

Error Handling​

CI/CD Usage​

Troubleshooting​

See also​