Fixing bug with multiple providers + stats for multiple runs by jottakka · Pull Request #752 · ArcadeAI/arcade-mcp

jottakka · 2026-01-23T18:58:32Z

@EricGustin you can use this cli command:

uv run arcade evals mcp_building_evals_results/eval_toolkit_iteration_dict.py \
    -p openai:gpt-4o,gpt-4o-mini \
    -p anthropic:claude-sonnet-4-20250514 \
    -k openai:$OPENAI_API_KEY \
    -k anthropic:$ANTHROPIC_API_KEY \
    -d \
    --num-runs 3 \
    --seed random \
    --multi-run-pass-rule majority \
    --max-concurrent 6 \
    -o mcp_building_evals_results/results

Note

Medium Risk
Touches core eval/capture execution and result aggregation, plus all formatter outputs; risk is mainly correctness/regression in scoring/serialization and the expanded HTML/JS rendering path.

Overview
Adds first-class multi-run support to arcade evals and capture mode, including new CLI flags --num-runs, --seed, and --multi-run-pass-rule, and fixes provider selection to accept repeated --use-provider entries (enabling multi-provider runs as documented).

Evaluation execution now reruns each case N times (with optional OpenAI seeding), aggregates outcomes via last/mean/majority, and emits per-run/aggregate statistics (run_stats) plus aggregated per-critic-field variance (critic_stats). All output formatters (text/markdown/html/json) are updated to display these stats, with HTML gaining per-run tabs and summary visuals.

Capture results gain a CapturedRun/runs structure so multi-run tool-call recordings are preserved and rendered across formats, and tests are expanded/adjusted accordingly (including XSS expectations due to new embedded JS).

^{Written by Cursor Bugbot for commit ef3a9d3. This will update automatically on new commits. Configure here.}

libs/arcade-evals/arcade_evals/_evalsuite/_capture.py

libs/arcade-cli/arcade_cli/formatters/markdown.py

libs/arcade-cli/arcade_cli/formatters/base.py

codecov · 2026-02-05T20:37:55Z

Codecov Report

❌ Patch coverage is 79.33248% with 161 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
libs/arcade-cli/arcade_cli/formatters/markdown.py	74.87%	52 Missing ⚠️
libs/arcade-cli/arcade_cli/formatters/html.py	78.86%	41 Missing ⚠️
libs/arcade-cli/arcade_cli/formatters/text.py	76.63%	25 Missing ⚠️
libs/arcade-evals/arcade_evals/eval.py	86.82%	17 Missing ⚠️
libs/arcade-cli/arcade_cli/formatters/json.py	70.58%	10 Missing ⚠️
libs/arcade-cli/arcade_cli/main.py	75.00%	5 Missing ⚠️
.../arcade_evals/_evalsuite/_comparative_execution.py	61.53%	5 Missing ⚠️
libs/arcade-cli/arcade_cli/formatters/base.py	81.81%	4 Missing ⚠️
...s/arcade-evals/arcade_evals/_evalsuite/_capture.py	90.47%	2 Missing ⚠️

Files with missing lines	Coverage Δ
libs/arcade-cli/arcade_cli/evals_runner.py	`83.52% <ø> (ø)`
libs/arcade-evals/arcade_evals/__init__.py	`100.00% <100.00%> (ø)`
...cade-evals/arcade_evals/_evalsuite/_comparative.py	`100.00% <100.00%> (ø)`
...ibs/arcade-evals/arcade_evals/_evalsuite/_types.py	`100.00% <100.00%> (ø)`
libs/arcade-evals/arcade_evals/capture.py	`92.94% <100.00%> (+0.73%)`	⬆️
...s/arcade-evals/arcade_evals/_evalsuite/_capture.py	`80.32% <90.47%> (+2.06%)`	⬆️
libs/arcade-cli/arcade_cli/formatters/base.py	`93.43% <81.81%> (-0.92%)`	⬇️
libs/arcade-cli/arcade_cli/main.py	`30.50% <75.00%> (+4.77%)`	⬆️
.../arcade_evals/_evalsuite/_comparative_execution.py	`93.15% <61.53%> (-6.85%)`	⬇️
libs/arcade-cli/arcade_cli/formatters/json.py	`78.45% <70.58%> (-0.40%)`	⬇️
... and 4 more

... and 2 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

libs/arcade-cli/arcade_cli/formatters/markdown.py

EricGustin

needs a version bump

EricGustin · 2026-02-06T20:59:11Z

libs/arcade-cli/arcade_cli/formatters/text.py

+                if runs:
+                    for run_index, run in enumerate(runs, start=1):
+                        lines.append(f"    Run {run_index}:")
+                        if run.tool_calls:
+                            for tc in run.tool_calls:
+                                total_calls += 1
+                                lines.append(f"      - {tc.name}")
+                                if tc.args:
+                                    for key, value in tc.args.items():
+                                        lines.append(
+                                            f"          {key}: {self._format_value(value)}"


this file has some wild nesting. I'm scared to read it

jottakka added 2 commits January 23, 2026 15:58

Fixing bug with multiple providers + stats for multiple runs

e929566

Fixing bug with multiple providers + stats for multiple runs

ab7ad3a

jottakka self-assigned this Feb 5, 2026

jottakka added 2 commits February 5, 2026 17:15

Fixing bug with multiple providers + stats for multiple runs

37a72f3

fixing failing checks

39d972f

cursor bot reviewed Feb 5, 2026

View reviewed changes

tryingfixing checks

dc1bfb4

cursor bot reviewed Feb 5, 2026

View reviewed changes

libs/arcade-cli/arcade_cli/formatters/markdown.py Show resolved Hide resolved

some changes after cursor review

8c27fbb

jottakka requested a review from EricGustin February 5, 2026 21:55

tfixing issue witnh checks

ef3a9d3

EricGustin approved these changes Feb 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixing bug with multiple providers + stats for multiple runs#752

Fixing bug with multiple providers + stats for multiple runs#752
jottakka wants to merge 7 commits intomainfrom
francisco/adding-stats-evals

jottakka commented Jan 23, 2026 •

edited by cursor bot

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Feb 5, 2026 •

edited

Loading

Uh oh!

cursor bot left a comment

Uh oh!

Uh oh!

EricGustin left a comment

Uh oh!

EricGustin Feb 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jottakka commented Jan 23, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

EricGustin left a comment

Choose a reason for hiding this comment

Uh oh!

EricGustin Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jottakka commented Jan 23, 2026 •

edited by cursor bot

Loading

codecov bot commented Feb 5, 2026 •

edited

Loading