Skip to content

Add gpt-oss-120b leaderboard submission#134

Open
radiangle wants to merge 2 commits intosierra-research:mainfrom
radiangle:submission-gpt-oss-120b
Open

Add gpt-oss-120b leaderboard submission#134
radiangle wants to merge 2 commits intosierra-research:mainfrom
radiangle:submission-gpt-oss-120b

Conversation

@radiangle
Copy link

Summary

Adding evaluation results for gpt-oss-120b model to the leaderboard.

Results (4 trials per domain)

Domain Pass@1 Pass@2 Pass@3 Pass@4
Airline 46.0% 35.3% 29.5% 26.0%
Retail 57.5% 45.8% 39.9% 36.0%
Telecom 58.1% 44.0% 36.6% 31.6%

Submission Details

  • Model: gpt-oss-120b
  • Model Organization: OpenAI
  • Submitting Organization: Nori
  • User Simulator: openai/gpt-oss-120b
  • Trials per domain: 4
  • Framework version: tau2-bench 0.2.1-dev

Agent Implementation

LangChain agent using LangGraph's create_react_agent. Implementation available at: https://github.com/radiangle/tau2-bench/blob/main/src/tau2/agent/langchain_agent.py

Verification

  • No modifications to prompts
  • No tasks omitted
  • Standard tau2-bench evaluation protocol

Results (4 trials per domain):
- Airline: Pass@1: 46.0%, @2: 35.3%, @3: 29.5%, @4: 26.0%
- Retail: Pass@1: 57.5%, @2: 45.8%, @3: 39.9%, @4: 36.0%
- Telecom: Pass@1: 58.1%, @2: 44.0%, @3: 36.6%, @4: 31.6%
@victorb-sierra
Copy link
Collaborator

Thank you for your submission.
cc: @benshi34

@benshi34
Copy link
Collaborator

Hi: your trajectory file names are formatted incorrectly, would you mind fixing those? Then I can merge since all else looks good. Thanks!

@radiangle radiangle force-pushed the submission-gpt-oss-120b branch from 26c10e4 to 625f537 Compare January 27, 2026 04:24
@radiangle
Copy link
Author

@benshi34 we've renamed the trajectory files, could you review them again? thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants