Add areal-sea leaderboard submission by GurrenLagann97 · Pull Request #152 · sierra-research/tau2-bench

GurrenLagann97 · 2026-02-02T14:53:18Z

Summary

Adding evaluation results for areal-eigen model to the leaderboard.

Results (4 trials per domain)

Domain	Pass@1	Pass@2	Pass@3	Pass@4
Airline	71.0%	68.0%	66.5%	66.0%
Retail	84.2%	75.2%	69.1%	64.9%
Telecom	94.1%	89.3%	85.5%	82.5%

Submission Details

Model: areal-sea
Model Organization: Tsinghua & eigen
Submitting Organization: Tsinghua & eigen
User Simulator: openai/gpt-4.1-2025-04-14
Trials per domain: 4
Framework version: tau2-bench 0.2.1-dev

Verification

No modifications to prompts
No tasks omitted
Standard tau2-bench evaluation protocol

victorb-sierra · 2026-02-05T07:31:45Z

Thanks for submitting to the tau2-bench leaderboard! The A-ReaL paper is a really interesting approach to post-training agents with verifiable rewards.

Before we can merge, we have a couple of questions/requests:

1. Airline task configuration

During validation, we noticed that the airline trajectory has reward_basis: ['DB'] while retail has ['DB', 'COMMUNICATE'] and telecom has ['ENV_ASSERTION']. The standard airline configuration should include ['DB', 'COMMUNICATE'] to evaluate whether the agent properly communicates required information (e.g., confirmation numbers, prices).

Could you clarify:

Was the COMMUNICATE check intentionally disabled for airline?
Which version/fork of tau2-bench was used for the airline evaluation?

(We verified the scores are unaffected since all COMMUNICATE failures overlap with DB failures, but we'd like to understand the discrepancy.)

2. Submission type should be "custom"

Since A-ReaL-Eigen was specifically trained using RL on the tau2-bench domain (rather than being a general-purpose LLM evaluated on the benchmark), this should be submitted as submission_type: "custom" rather than "standard".

Please update your submission to:

Set "submission_type": "custom"
Add details to methodology.notes explaining the training approach (e.g., that the model was post-trained with RL on tau2 bench domain - although not on the specific tasks, etc.)

This helps users understand which models were specifically optimized for this benchmark vs. general-purpose models being evaluated.

Let us know if you have any questions!

(cc: @benshi34)

GurrenLagann97 · 2026-02-11T08:26:45Z

Thanks for your comments! We 've already set submission type to custom and add training details in methodology.notes. Regarding the concern you mentioned about the reward_basis in airline evaluation, during our RL training pipeline we adopted a modified evaluation configuration with reward_basis: ['DB'] only, as DB state changes were used as the primary reward signal for policy optimization. However, for the final evaluation of this submission, we inadvertently reused this training-time configuration, which was an oversight, instead of reverting to the standard ['DB', 'COMMUNICATE'] setting. We sincerely apologize for the confusion this may have caused.

benshi34 · 2026-02-11T22:41:12Z

@victorb-sierra This looks good to merge on my end. Any other comments?

GurrenLagann97 · 2026-02-28T15:03:28Z

Hi, @victorb-sierra Just a gentle follow-up on this PR. I understand you might be busy — whenever you have a chance, I’d really appreciate your review.
Please let me know if there’s anything I should update. Thanks a lot!

add areal-eigen model to the leadboard

633a316

GurrenLagann97 requested a review from victorb-sierra as a code owner February 2, 2026 14:53

victorb-sierra added the leaderboard submission label Feb 5, 2026

siyuyao-sierra requested a review from benshi34 February 10, 2026 20:02

change submission type to custom & add descriptions of training details

3bb7e0a

modify model description

1f9fb64

GurrenLagann97 changed the title ~~Add a-real-eigen leaderboard submission~~ Add areal-sea leaderboard submission Feb 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add areal-sea leaderboard submission#152

Add areal-sea leaderboard submission#152
GurrenLagann97 wants to merge 3 commits intosierra-research:mainfrom
GurrenLagann97:submission-areal-eigen

GurrenLagann97 commented Feb 2, 2026 •

edited

Loading

Uh oh!

victorb-sierra commented Feb 5, 2026

Uh oh!

GurrenLagann97 commented Feb 11, 2026

Uh oh!

benshi34 commented Feb 11, 2026

Uh oh!

GurrenLagann97 commented Feb 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

GurrenLagann97 commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Results (4 trials per domain)

Submission Details

Verification

Uh oh!

victorb-sierra commented Feb 5, 2026

Uh oh!

GurrenLagann97 commented Feb 11, 2026

Uh oh!

benshi34 commented Feb 11, 2026

Uh oh!

GurrenLagann97 commented Feb 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

GurrenLagann97 commented Feb 2, 2026 •

edited

Loading