Skip to content

Fixing bug with multiple providers + stats for multiple runs#752

Open
jottakka wants to merge 7 commits intomainfrom
francisco/adding-stats-evals
Open

Fixing bug with multiple providers + stats for multiple runs#752
jottakka wants to merge 7 commits intomainfrom
francisco/adding-stats-evals

Conversation

@jottakka
Copy link
Contributor

@jottakka jottakka commented Jan 23, 2026

@EricGustin you can use this cli command:

uv run arcade evals mcp_building_evals_results/eval_toolkit_iteration_dict.py \
    -p openai:gpt-4o,gpt-4o-mini \
    -p anthropic:claude-sonnet-4-20250514 \
    -k openai:$OPENAI_API_KEY \
    -k anthropic:$ANTHROPIC_API_KEY \
    -d \
    --num-runs 3 \
    --seed random \
    --multi-run-pass-rule majority \
    --max-concurrent 6 \
    -o mcp_building_evals_results/results


Note

Medium Risk
Touches core eval/capture execution and result aggregation, plus all formatter outputs; risk is mainly correctness/regression in scoring/serialization and the expanded HTML/JS rendering path.

Overview
Adds first-class multi-run support to arcade evals and capture mode, including new CLI flags --num-runs, --seed, and --multi-run-pass-rule, and fixes provider selection to accept repeated --use-provider entries (enabling multi-provider runs as documented).

Evaluation execution now reruns each case N times (with optional OpenAI seeding), aggregates outcomes via last/mean/majority, and emits per-run/aggregate statistics (run_stats) plus aggregated per-critic-field variance (critic_stats). All output formatters (text/markdown/html/json) are updated to display these stats, with HTML gaining per-run tabs and summary visuals.

Capture results gain a CapturedRun/runs structure so multi-run tool-call recordings are preserved and rendered across formats, and tests are expanded/adjusted accordingly (including XSS expectations due to new embedded JS).

Written by Cursor Bugbot for commit ef3a9d3. This will update automatically on new commits. Configure here.

@jottakka jottakka self-assigned this Feb 5, 2026
@codecov
Copy link

codecov bot commented Feb 5, 2026

Codecov Report

❌ Patch coverage is 79.33248% with 161 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
libs/arcade-cli/arcade_cli/formatters/markdown.py 74.87% 52 Missing ⚠️
libs/arcade-cli/arcade_cli/formatters/html.py 78.86% 41 Missing ⚠️
libs/arcade-cli/arcade_cli/formatters/text.py 76.63% 25 Missing ⚠️
libs/arcade-evals/arcade_evals/eval.py 86.82% 17 Missing ⚠️
libs/arcade-cli/arcade_cli/formatters/json.py 70.58% 10 Missing ⚠️
libs/arcade-cli/arcade_cli/main.py 75.00% 5 Missing ⚠️
.../arcade_evals/_evalsuite/_comparative_execution.py 61.53% 5 Missing ⚠️
libs/arcade-cli/arcade_cli/formatters/base.py 81.81% 4 Missing ⚠️
...s/arcade-evals/arcade_evals/_evalsuite/_capture.py 90.47% 2 Missing ⚠️
Files with missing lines Coverage Δ
libs/arcade-cli/arcade_cli/evals_runner.py 83.52% <ø> (ø)
libs/arcade-evals/arcade_evals/__init__.py 100.00% <100.00%> (ø)
...cade-evals/arcade_evals/_evalsuite/_comparative.py 100.00% <100.00%> (ø)
...ibs/arcade-evals/arcade_evals/_evalsuite/_types.py 100.00% <100.00%> (ø)
libs/arcade-evals/arcade_evals/capture.py 92.94% <100.00%> (+0.73%) ⬆️
...s/arcade-evals/arcade_evals/_evalsuite/_capture.py 80.32% <90.47%> (+2.06%) ⬆️
libs/arcade-cli/arcade_cli/formatters/base.py 93.43% <81.81%> (-0.92%) ⬇️
libs/arcade-cli/arcade_cli/main.py 30.50% <75.00%> (+4.77%) ⬆️
.../arcade_evals/_evalsuite/_comparative_execution.py 93.15% <61.53%> (-6.85%) ⬇️
libs/arcade-cli/arcade_cli/formatters/json.py 78.45% <70.58%> (-0.40%) ⬇️
... and 4 more

... and 2 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

@jottakka jottakka requested a review from EricGustin February 5, 2026 21:55
Copy link
Member

@EricGustin EricGustin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs a version bump

Comment on lines +946 to +956
if runs:
for run_index, run in enumerate(runs, start=1):
lines.append(f" Run {run_index}:")
if run.tool_calls:
for tc in run.tool_calls:
total_calls += 1
lines.append(f" - {tc.name}")
if tc.args:
for key, value in tc.args.items():
lines.append(
f" {key}: {self._format_value(value)}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this file has some wild nesting. I'm scared to read it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants