Submission: Claude Sonnet 4.5 with extended(interleaved) thinking and trajectories by Hrithik2212 · Pull Request #73 · sierra-research/tau2-bench

Hrithik2212 · 2025-10-31T12:27:27Z

Hi,

This is a pull request for Claude Sonnet 4.5 without any changes to the base prompts and with full trajectories

victorb-sierra · 2025-11-04T17:40:15Z

Thank you! @benshi34 can you take a look at this?

benshi34 · 2025-11-04T17:45:04Z

Hi @Hrithik2212, thanks for submitting your trajectories! Quick question: Why are the telecom scores much lower than the reported scores? Are you affiliated with Anthropic?

Hrithik2212 · 2025-11-05T05:44:46Z

Hi @benshi34 ,

This run used the vanilla prompts, with the only change being the integration of interleaved thinking in the code. From what I understand, the Anthropic devs mentioned that they added prompt addendums to the telecom agent policy and the user prompt to avoid failure modes caused by the user ending incorrectly which would be the reason for higher scores on telecom.

Hrithik2212 requested a review from victorb-sierra as a code owner October 31, 2025 12:27

Hrithik2212 changed the title ~~Submission: Claude Sonnet 4.5 with extended(interleaved-thinking) and trajectories~~ Submission: Claude Sonnet 4.5 with extended(interleaved) thinking and trajectories Oct 31, 2025

victorb-sierra assigned benshi34 Nov 4, 2025

Submission Claude Sonnet 4.5 with extended(interleaved-thinking)

bfe508e

shivanibokadia-vl force-pushed the aide-sonnet-4-5 branch from 61f4304 to bfe508e Compare November 9, 2025 14:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Submission: Claude Sonnet 4.5 with extended(interleaved) thinking and trajectories#73

Submission: Claude Sonnet 4.5 with extended(interleaved) thinking and trajectories#73
Hrithik2212 wants to merge 1 commit intosierra-research:mainfrom
shivanibokadia-vl:aide-sonnet-4-5

Hrithik2212 commented Oct 31, 2025

Uh oh!

victorb-sierra commented Nov 4, 2025

Uh oh!

benshi34 commented Nov 4, 2025

Uh oh!

Hrithik2212 commented Nov 5, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Hrithik2212 commented Oct 31, 2025

Uh oh!

victorb-sierra commented Nov 4, 2025

Uh oh!

benshi34 commented Nov 4, 2025

Uh oh!

Hrithik2212 commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Hrithik2212 commented Nov 5, 2025 •

edited

Loading