Skip to content

Releases: vercel-labs/agent-eval

@vercel/agent-eval@0.7.0

12 Feb 21:57
a784dcf

Choose a tag to compare

Minor Changes

  • #73 be7ca15 Thanks @gaojude! - Add Cursor CLI agent with direct API and stream-json transcript support. Enables testing against Cursor models (default: composer-1.5) through direct API access. The agent captures detailed execution transcripts in JSONL format and is fully integrated with the eval framework sandbox infrastructure.

  • #71 8f198d4 Thanks @gaojude! - Add Gemini CLI agent with direct API and stream-json transcript support. Enables testing against Gemini models (default: gemini-3-pro-preview) through direct Google API access. The agent captures detailed execution transcripts in JSONL format and is fully integrated with the eval framework sandbox infrastructure.

  • #74 087415c Thanks @gaojude! - Add transcript parsers for Gemini and Cursor agents to the o11y module

@vercel/agent-eval@0.6.2

12 Feb 15:51
0b7bb88

Choose a tag to compare

Patch Changes

@vercel/agent-eval@0.6.1

12 Feb 08:44
d37b219

Choose a tag to compare

Patch Changes

@vercel/agent-eval@0.6.0

11 Feb 20:29
68adf65

Choose a tag to compare

Minor Changes

  • #65 cf50218 Thanks @gaojude! - Make classifier feature optional and add feature flag

    Features:

    • Added isClassifierEnabled() function to check if classifier is available (requires AI_GATEWAY_API_KEY or VERCEL_OIDC_TOKEN)
    • Classifier is now optional: if neither env var is set, classification is skipped and all results are preserved
    • Warning message now displays when classifier is disabled, explaining why the keys are needed
    • Updated README to document classifier behavior and environment variable requirements

    Changes:

    • CLI skips entire classification block when classifier is disabled
    • Housekeeping no longer removes non-model failures when classifier is disabled (only removes incomplete/duplicate results)
    • All tests updated to properly enable classifier for tests that require it
    • Added test case for disabled classifier behavior

@vercel/agent-eval@0.5.0

10 Feb 23:07
4f9d492

Choose a tag to compare

Minor Changes

Patch Changes

@vercel/agent-eval@0.4.1

09 Feb 22:12
a0cc76e

Choose a tag to compare

Patch Changes

  • #58 6cd92aa Thanks @allenzhou101! - Fix transcript parsing for Codex and OpenCode agents

    Codex:

    • Added support for item.started and item.completed event types from OpenAI Responses API
    • Now properly parses reasoning items as thinking blocks
    • Now properly parses command_execution items as shell tool calls with exit codes
    • Now properly parses agent_message items as assistant messages
    • Fixed critical bug in command_execution success logic: changed from OR (||) to AND (&&) so commands with non-zero exit codes are correctly marked as failed even when status is "completed"
    • Transcript parsing now correctly reports turn counts, tool calls, thinking blocks, and shell command results

    OpenCode:

    • Fixed exit code checking for bash commands - now correctly marks commands with non-zero exit codes as failed
    • Shell commands with exit code 127 (command not found) now properly show success: false instead of success: true

    Playground:

    • Updated shell command display to check success field first, then fall back to exit code
    • Added tooltip showing exit code on hover for shell commands

    Both parsers are model-agnostic and work consistently across all model variants using their respective APIs.

@vercel/agent-eval@0.4.0

09 Feb 19:29
bf99e42

Choose a tag to compare

Minor Changes

  • #56 5e45159 Thanks @gaojude! - Support reasoning effort via model string query params for Codex (e.g. gpt-5.3-codex?reasoningEffort=high), install CA certificates in Docker sandbox, retry npm install once on failure, and exclude smoke test results from fingerprint-based reuse.

@vercel/agent-eval-playground@0.1.3

09 Feb 22:12
a0cc76e

Choose a tag to compare

Patch Changes

  • #58 e42dbf7 Thanks @allenzhou101! - Fix shell command success/failure display

    • Updated shell command badges to check success field first, then fall back to exitCode === 0
    • Added tooltip showing exit code on hover
    • Commands with non-zero exit codes now correctly display in red (destructive variant)

@vercel/agent-eval@0.3.2

08 Feb 21:28
89ab4b7

Choose a tag to compare

Patch Changes

@vercel/agent-eval@0.3.1

08 Feb 06:06
cfa2555

Choose a tag to compare

Patch Changes

  • #47 e10e69b Thanks @gaojude! - Fix fingerprint reuse: fingerprints are now persisted to summary.json so results can actually be reused across runs. Also fixes --dry to check reusability and report what would run, --smoke to always run fresh and skip housekeeping, and housekeeping to dedupe by fingerprint so results from different configs coexist.