-
Notifications
You must be signed in to change notification settings - Fork 193
Open
Labels
Milestone
Description
Problem
Current APIs in Tau2 tasks may not reflect real-world usage or complexity, limiting the benchmark’s usefulness for evaluating agents in realistic scenarios.
Proposal
- P0: Update APIs to be more realistic and reflective of tools used in production.
- P1: Introduce failure modes or constraints that agents must handle.
Discussion Points
- Which APIs should be prioritized for realism improvements?
- How should failure modes be designed to balance realism with testability?
- Any compatibility concerns with existing tasks or evaluations?
- Is it easy to agree on what "realistic" APIs are?
Next Steps (after consensus)
- Break proposed changes into actionable Feature/Task issues.
- Assign issues to the
v3.0milestone andv3branch for PRs.
Reactions are currently unavailable