Releases: vercel-labs/agent-eval
@vercel/agent-eval@0.7.0
Minor Changes
-
#73
be7ca15Thanks @gaojude! - Add Cursor CLI agent with direct API and stream-json transcript support. Enables testing against Cursor models (default:composer-1.5) through direct API access. The agent captures detailed execution transcripts in JSONL format and is fully integrated with the eval framework sandbox infrastructure. -
#71
8f198d4Thanks @gaojude! - Add Gemini CLI agent with direct API and stream-json transcript support. Enables testing against Gemini models (default:gemini-3-pro-preview) through direct Google API access. The agent captures detailed execution transcripts in JSONL format and is fully integrated with the eval framework sandbox infrastructure. -
#74
087415cThanks @gaojude! - Add transcript parsers for Gemini and Cursor agents to the o11y module
@vercel/agent-eval@0.6.2
Patch Changes
- #69
93c1a63Thanks @paoloricciuti! - fix: add all the files to track newly created files
@vercel/agent-eval@0.6.1
Patch Changes
- #64
f7b663aThanks @paoloricciuti! - feat: add option to save the updated project inside results
@vercel/agent-eval@0.6.0
Minor Changes
-
#65
cf50218Thanks @gaojude! - Make classifier feature optional and add feature flagFeatures:
- Added
isClassifierEnabled()function to check if classifier is available (requiresAI_GATEWAY_API_KEYorVERCEL_OIDC_TOKEN) - Classifier is now optional: if neither env var is set, classification is skipped and all results are preserved
- Warning message now displays when classifier is disabled, explaining why the keys are needed
- Updated README to document classifier behavior and environment variable requirements
Changes:
- CLI skips entire classification block when classifier is disabled
- Housekeeping no longer removes non-model failures when classifier is disabled (only removes incomplete/duplicate results)
- All tests updated to properly enable classifier for tests that require it
- Added test case for disabled classifier behavior
- Added
@vercel/agent-eval@0.5.0
@vercel/agent-eval@0.4.1
Patch Changes
-
#58
6cd92aaThanks @allenzhou101! - Fix transcript parsing for Codex and OpenCode agentsCodex:
- Added support for
item.startedanditem.completedevent types from OpenAI Responses API - Now properly parses
reasoningitems as thinking blocks - Now properly parses
command_executionitems as shell tool calls with exit codes - Now properly parses
agent_messageitems as assistant messages - Fixed critical bug in
command_executionsuccess logic: changed from OR (||) to AND (&&) so commands with non-zero exit codes are correctly marked as failed even when status is "completed" - Transcript parsing now correctly reports turn counts, tool calls, thinking blocks, and shell command results
OpenCode:
- Fixed exit code checking for bash commands - now correctly marks commands with non-zero exit codes as failed
- Shell commands with exit code 127 (command not found) now properly show
success: falseinstead ofsuccess: true
Playground:
- Updated shell command display to check
successfield first, then fall back to exit code - Added tooltip showing exit code on hover for shell commands
Both parsers are model-agnostic and work consistently across all model variants using their respective APIs.
- Added support for
@vercel/agent-eval@0.4.0
@vercel/agent-eval-playground@0.1.3
Patch Changes
-
#58
e42dbf7Thanks @allenzhou101! - Fix shell command success/failure display- Updated shell command badges to check
successfield first, then fall back toexitCode === 0 - Added tooltip showing exit code on hover
- Commands with non-zero exit codes now correctly display in red (destructive variant)
- Updated shell command badges to check
@vercel/agent-eval@0.3.2
Patch Changes
- #49
465fbacThanks @paoloricciuti! - fix: allowVERCEL_OIDC_TOKENifAI_GATEWAY_API_KEYis not set
@vercel/agent-eval@0.3.1
Patch Changes
- #47
e10e69bThanks @gaojude! - Fix fingerprint reuse: fingerprints are now persisted tosummary.jsonso results can actually be reused across runs. Also fixes--dryto check reusability and report what would run,--smoketo always run fresh and skip housekeeping, and housekeeping to dedupe by fingerprint so results from different configs coexist.