Add areal-sea leaderboard submission#152
Add areal-sea leaderboard submission#152GurrenLagann97 wants to merge 3 commits intosierra-research:mainfrom
Conversation
|
Thanks for submitting to the tau2-bench leaderboard! The A-ReaL paper is a really interesting approach to post-training agents with verifiable rewards. Before we can merge, we have a couple of questions/requests: 1. Airline task configuration During validation, we noticed that the airline trajectory has Could you clarify:
(We verified the scores are unaffected since all COMMUNICATE failures overlap with DB failures, but we'd like to understand the discrepancy.) 2. Submission type should be "custom" Since A-ReaL-Eigen was specifically trained using RL on the tau2-bench domain (rather than being a general-purpose LLM evaluated on the benchmark), this should be submitted as Please update your submission to:
This helps users understand which models were specifically optimized for this benchmark vs. general-purpose models being evaluated. Let us know if you have any questions! (cc: @benshi34) |
|
Thanks for your comments! We 've already set submission type to |
|
@victorb-sierra This looks good to merge on my end. Any other comments? |
|
Hi, @victorb-sierra Just a gentle follow-up on this PR. I understand you might be busy — whenever you have a chance, I’d really appreciate your review. |
Summary
Adding evaluation results for areal-eigen model to the leaderboard.
Results (4 trials per domain)
Submission Details
Verification