Skip to content

Add areal-sea leaderboard submission#152

Open
GurrenLagann97 wants to merge 3 commits intosierra-research:mainfrom
GurrenLagann97:submission-areal-eigen
Open

Add areal-sea leaderboard submission#152
GurrenLagann97 wants to merge 3 commits intosierra-research:mainfrom
GurrenLagann97:submission-areal-eigen

Conversation

@GurrenLagann97
Copy link

@GurrenLagann97 GurrenLagann97 commented Feb 2, 2026

Summary

Adding evaluation results for areal-eigen model to the leaderboard.

Results (4 trials per domain)

Domain Pass@1 Pass@2 Pass@3 Pass@4
Airline 71.0% 68.0% 66.5% 66.0%
Retail 84.2% 75.2% 69.1% 64.9%
Telecom 94.1% 89.3% 85.5% 82.5%

Submission Details

  • Model: areal-sea
  • Model Organization: Tsinghua & eigen
  • Submitting Organization: Tsinghua & eigen
  • User Simulator: openai/gpt-4.1-2025-04-14
  • Trials per domain: 4
  • Framework version: tau2-bench 0.2.1-dev

Verification

  • No modifications to prompts
  • No tasks omitted
  • Standard tau2-bench evaluation protocol

@victorb-sierra
Copy link
Collaborator

Thanks for submitting to the tau2-bench leaderboard! The A-ReaL paper is a really interesting approach to post-training agents with verifiable rewards.

Before we can merge, we have a couple of questions/requests:

1. Airline task configuration

During validation, we noticed that the airline trajectory has reward_basis: ['DB'] while retail has ['DB', 'COMMUNICATE'] and telecom has ['ENV_ASSERTION']. The standard airline configuration should include ['DB', 'COMMUNICATE'] to evaluate whether the agent properly communicates required information (e.g., confirmation numbers, prices).

Could you clarify:

  • Was the COMMUNICATE check intentionally disabled for airline?
  • Which version/fork of tau2-bench was used for the airline evaluation?

(We verified the scores are unaffected since all COMMUNICATE failures overlap with DB failures, but we'd like to understand the discrepancy.)

2. Submission type should be "custom"

Since A-ReaL-Eigen was specifically trained using RL on the tau2-bench domain (rather than being a general-purpose LLM evaluated on the benchmark), this should be submitted as submission_type: "custom" rather than "standard".

Please update your submission to:

  • Set "submission_type": "custom"
  • Add details to methodology.notes explaining the training approach (e.g., that the model was post-trained with RL on tau2 bench domain - although not on the specific tasks, etc.)

This helps users understand which models were specifically optimized for this benchmark vs. general-purpose models being evaluated.

Let us know if you have any questions!

(cc: @benshi34)

@GurrenLagann97
Copy link
Author

Thanks for your comments! We 've already set submission type to custom and add training details in methodology.notes. Regarding the concern you mentioned about the reward_basis in airline evaluation, during our RL training pipeline we adopted a modified evaluation configuration with reward_basis: ['DB'] only, as DB state changes were used as the primary reward signal for policy optimization. However, for the final evaluation of this submission, we inadvertently reused this training-time configuration, which was an oversight, instead of reverting to the standard ['DB', 'COMMUNICATE'] setting. We sincerely apologize for the confusion this may have caused.

@benshi34
Copy link
Collaborator

@victorb-sierra This looks good to merge on my end. Any other comments?

@GurrenLagann97 GurrenLagann97 changed the title Add a-real-eigen leaderboard submission Add areal-sea leaderboard submission Feb 28, 2026
@GurrenLagann97
Copy link
Author

Hi, @victorb-sierra Just a gentle follow-up on this PR. I understand you might be busy — whenever you have a chance, I’d really appreciate your review.
Please let me know if there’s anything I should update. Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants