Releases · vercel-labs/agent-eval · GitHub

12 Feb 21:57

@vercel/agent-eval@0.7.0

@vercel/agent-eval@0.7.0 Latest

Latest

Minor Changes

#73 be7ca15 Thanks @gaojude! - Add Cursor CLI agent with direct API and stream-json transcript support. Enables testing against Cursor models (default: composer-1.5) through direct API access. The agent captures detailed execution transcripts in JSONL format and is fully integrated with the eval framework sandbox infrastructure.
#71 8f198d4 Thanks @gaojude! - Add Gemini CLI agent with direct API and stream-json transcript support. Enables testing against Gemini models (default: gemini-3-pro-preview) through direct Google API access. The agent captures detailed execution transcripts in JSONL format and is fully integrated with the eval framework sandbox infrastructure.
#74 087415c Thanks @gaojude! - Add transcript parsers for Gemini and Cursor agents to the o11y module

Assets 2

12 Feb 15:51

@vercel/agent-eval@0.6.2

@vercel/agent-eval@0.6.2

Patch Changes

#69 93c1a63 Thanks @paoloricciuti! - fix: add all the files to track newly created files

Assets 2

12 Feb 08:44

@vercel/agent-eval@0.6.1

@vercel/agent-eval@0.6.1

Patch Changes

#64 f7b663a Thanks @paoloricciuti! - feat: add option to save the updated project inside results

Assets 2

11 Feb 20:29

@vercel/agent-eval@0.6.0

@vercel/agent-eval@0.6.0

Minor Changes

#65 cf50218 Thanks @gaojude! - Make classifier feature optional and add feature flag

Features:
- Added isClassifierEnabled() function to check if classifier is available (requires AI_GATEWAY_API_KEY or VERCEL_OIDC_TOKEN)
- Classifier is now optional: if neither env var is set, classification is skipped and all results are preserved
- Warning message now displays when classifier is disabled, explaining why the keys are needed
- Updated README to document classifier behavior and environment variable requirements
Changes:
- CLI skips entire classification block when classifier is disabled
- Housekeeping no longer removes non-model failures when classifier is disabled (only removes incomplete/duplicate results)
- All tests updated to properly enable classifier for tests that require it
- Added test case for disabled classifier behavior

Assets 2

10 Feb 23:07

@vercel/agent-eval@0.5.0

@vercel/agent-eval@0.5.0

Minor Changes

#63 bc5114c Thanks @gaojude! - Add live terminal dashboard for parallel experiment runs

Patch Changes

#61 b846fc7 Thanks @paoloricciuti! - fix: allow user defined tests in verifyNoTestFiles

Assets 2

09 Feb 22:12

@vercel/agent-eval@0.4.1

@vercel/agent-eval@0.4.1

Patch Changes

#58 6cd92aa Thanks @allenzhou101! - Fix transcript parsing for Codex and OpenCode agents

Codex:
- Added support for item.started and item.completed event types from OpenAI Responses API
- Now properly parses reasoning items as thinking blocks
- Now properly parses command_execution items as shell tool calls with exit codes
- Now properly parses agent_message items as assistant messages
- Fixed critical bug in command_execution success logic: changed from OR (||) to AND (&&) so commands with non-zero exit codes are correctly marked as failed even when status is "completed"
- Transcript parsing now correctly reports turn counts, tool calls, thinking blocks, and shell command results
OpenCode:
- Fixed exit code checking for bash commands - now correctly marks commands with non-zero exit codes as failed
- Shell commands with exit code 127 (command not found) now properly show success: false instead of success: true
Playground:
- Updated shell command display to check success field first, then fall back to exit code
- Added tooltip showing exit code on hover for shell commands
Both parsers are model-agnostic and work consistently across all model variants using their respective APIs.

Assets 2

09 Feb 19:29

@vercel/agent-eval@0.4.0

@vercel/agent-eval@0.4.0

Minor Changes

#56 5e45159 Thanks @gaojude! - Support reasoning effort via model string query params for Codex (e.g. gpt-5.3-codex?reasoningEffort=high), install CA certificates in Docker sandbox, retry npm install once on failure, and exclude smoke test results from fingerprint-based reuse.

Assets 2

09 Feb 22:12

@vercel/agent-eval-playground@0.1.3

@vercel/agent-eval-playground@0.1.3

Patch Changes

#58 e42dbf7 Thanks @allenzhou101! - Fix shell command success/failure display
- Updated shell command badges to check success field first, then fall back to exitCode === 0
- Added tooltip showing exit code on hover
- Commands with non-zero exit codes now correctly display in red (destructive variant)

Assets 2

08 Feb 21:28

@vercel/agent-eval@0.3.2

@vercel/agent-eval@0.3.2

Patch Changes

#49 465fbac Thanks @paoloricciuti! - fix: allow VERCEL_OIDC_TOKEN if AI_GATEWAY_API_KEY is not set

Assets 2

08 Feb 06:06

@vercel/agent-eval@0.3.1

@vercel/agent-eval@0.3.1

Patch Changes

#47 e10e69b Thanks @gaojude! - Fix fingerprint reuse: fingerprints are now persisted to summary.json so results can actually be reused across runs. Also fixes --dry to check reusability and report what would run, --smoke to always run fresh and skip housekeeping, and housekeeping to dedupe by fingerprint so results from different configs coexist.

Assets 2