评估运行器
用测试套件对智能体做基准测试。使用 @agentskit/eval 测量准确率、延迟与成本。
基本用法
import { createEvalRunner } from '@agentskit/eval'
import { createRuntime } from '@agentskit/runtime'
const runtime = createRuntime({ adapter: yourAdapter })
const runner = createEvalRunner({
agent: (task) => runtime.run(task),
})
const results = await runner.run({
name: 'QA accuracy',
cases: [
{ input: 'What is 2+2?', expected: '4' },
{ input: 'Capital of France?', expected: (result) => result.includes('Paris') },
{ input: 'Translate "hello" to Spanish', expected: 'hola' },
],
})
console.log(`Accuracy: ${(results.accuracy * 100).toFixed(1)}%`)
console.log(`Passed: ${results.passed}/${results.totalCases}`)
自定义指标
const results = await runner.run({
name: 'Performance benchmark',
cases: [
{ input: 'Summarize this article...', expected: (r) => r.length < 500 },
],
})
// Per-case results include latency and token usage
results.results.forEach((r) => {
console.log(`${r.passed ? 'PASS' : 'FAIL'} | ${r.latencyMs}ms | ${r.input.slice(0, 40)}...`)
})
CI 集成
# Run evals as part of CI
node eval.ts && echo "All evals passed" || exit 1