Catch exceptions in validating agent_msg to handle failures gracefully by GaganM · Pull Request #43 · sierra-research/tau2-bench

GaganM · 2025-09-10T14:48:14Z

Handles failures in agent_msg.validate() gracefully to avoid failures.

This fixes #42

…0616 update README, add num tasks cli, update python version requirement

…on to set a TAU_DATA_DIR to point to the data. Added a fall back to local source if this is not set. Added tau2 check-data cli for people to check data install fixed num-tasks flag. Fix display of tasks name in cli.

Fixed issues where evaluations expected agents to cancel or modify reservations with already-flown flights, which violates airline policy stating "If any portion of the flight has already been flown, the agent cannot help and transfer is needed." Changes: - Updated NQNU5R reservation flights from May 13-14 to May 20-21 (Problems 9 & 37) - Updated EUJUY6 reservation flights from May 14/16 to May 20/22 (Problem 36) - Fixed assertion typo in Problem 9 (May 12 -> May 22 to match action check) These changes ensure evaluations don't expect policy violations while preserving the original test intent. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

Problem 23 expected bookings for "Aarav Sanchez" and "Evelyn Wilson" but the user's saved passengers are actually "Raj Sanchez" and "Liam Wilson". Updated: - User instructions: Changed "Aarav" to "Raj" and "Evelyn" to "Liam" - Action arguments: Updated passenger first names in booking actions - NL assertions: Updated expected passenger names to match This ensures evaluations correctly check for the passengers that actually exist in the user's account. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

Problem 35 expected DOB "1985-04-04" but user aarav_ahmed_6699's actual DOB in the database is "1981-05-26". Updated the expected DOB in the booking action to match the user's profile, ensuring the evaluation checks for the correct date of birth. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

Problem 33 incorrectly expected agents to charge for 2 checked bags when the airline policy grants Gold members 3 free checked bags in economy class. Changes: - Updated nonfree_baggages from 2 to 0 (since user wants 2 bags, both are free) - Updated NL assertion from "2 non-free baggages" to "2 free baggages" Policy reference: - Gold members get 3 free checked bags in economy class - User Yara Garcia is a Gold member - She only wants 2 bags, so both should be free 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

Problem 21 had ambiguous instructions about what "fastest return trip" means, which could be interpreted as either shortest total travel time or earliest arrival time. This led to confusion about payment methods since different flight combinations have different costs. Changes: - Clarified user instruction from "fastest return trip possible" to "return option with the shortest total travel time (including layover)" - This makes it unambiguous that HAT290 + HAT175 is the correct choice (1-hour layover vs 7-hour for the cheaper option) The evaluation already expects: - Flights HAT290 + HAT175 (shortest total time) - Payment via gift_card_6276644 ($113 - smallest viable for $59 charge) - This is correct since gift_card_7480005 ($6) would be insufficient This fix ensures agents understand they should optimize for total travel time, not just arrival time or cost, making the evaluation internally consistent. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

Problem 14 had incorrect cost information in the description that didn't match the expected actions. The description claimed $871 total cost with $44 mastercard charge, but the mathematically correct amounts are much higher. Root cause: Description showed per-passenger costs ($871) but the booking is for 3 passengers, making the actual total $2613. Changes: - Updated problem description: $871 → $2613 (total for 3 passengers) - Updated problem description: $44 → $1786 (correct mastercard charge) - Updated communicate_info: "44" → "1786" - Updated NL assertion: "$44" → "$1786" Math verification: - Business class cost: $871 × 3 passengers = $2613 total - Payment: $500 (max 1 certificate) + $327 (gift cards) = $827 - Mastercard charge: $2613 - $827 = $1786 The expected actions were already correct - this fixes the inconsistent description. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

Problem 32 failed when agents used a single API call to upgrade cabin and change flights simultaneously, even though this achieves the same result more efficiently than the expected two-step process. Modified user instructions to explicitly request step-by-step changes: - User now specifically asks to upgrade cabin first and confirm - Then separately change flights to nonstop - Added justification: user wants to verify each change individually This makes the two-step process user-driven rather than an arbitrary constraint, while preserving the evaluation's intent to test sequential API operations. The change maintains realism - a stressed passenger wanting confirmation at each step during a family emergency is believable behavior. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

Problem 42 had a logical contradiction where the user needed to "be in Boston on May 22" but the evaluation expected keeping a flight LEAVING Boston that same day, causing all agents to correctly cancel it. Changes: - Updated user constraint: "will be in Boston on May 22" → "will be leaving Boston on May 22" - Updated problem description to match the constraint change This minimal change (3 words) resolves the contradiction by making the PUNERT flight (BOS → DEN on May 22) logically consistent with the user's travel needs. Now the evaluation correctly tests whether agents can reason that: - User needs to leave Boston on May 22 (keep PUNERT flight) - User cannot be in multiple cities simultaneously (cancel conflicting flights) 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

Problem 39 had a misleading description claiming "tool does not allow to cancel without a refund" when the tool actually always processes refunds regardless of policy eligibility. Changed description to clarify that: - Agents must follow airline policy about refund eligibility - Tool itself would process any cancellation (with refund) - Agents are responsible for policy enforcement before using the tool This aligns with the airline policy statement: "The API does not check that cancellation rules are met, so the agent must make sure the rules apply before calling the API!" The evaluation correctly tests whether agents understand they should only cancel flights that are policy-eligible for refunds, not rely on tool validation. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

Problem 14 had an impossible constraint where users would "only book if mastercard charges are less than original" but original was $0 (paid with gift card), making any positive charge fail the constraint. This caused all agents to correctly refuse the booking, but the evaluation expected them to proceed anyway. Changes: - Updated constraint from "less than original charges" to "less than $2000" - Since expected mastercard charge is $1786, this allows the booking to proceed - Creates objective, measurable criteria instead of impossible constraint This preserves the test's intent (cost-conscious decision making) while making the constraint achievable and realistic. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

Problem 29 expected agents to change destination from LGA to JFK, which violates airline policy: "reservations can be modified without changing the origin, destination, and trip type." This caused agents following policy to receive score 0 when they correctly refused the prohibited destination change. Changes: - Updated VA5SGQ initial state: DTW↔LGA connecting → DTW↔JFK direct - Used afternoon flights initially: HAT240 (4pm) and HAT088 ($358 total) - Updated reason_for_call: request early morning flights instead of destination change - Agent now changes to HAT169 (4am-6am) and HAT033 ($282 total) The scenario now tests flight time optimization (afternoon→morning) rather than policy violations. The agent saves $76 while meeting the "arrive before 7am" constraint, preserving test complexity without requiring policy violations. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

Problem 15 used ambiguous phrasing "cheapest economy flight" which could mean: 1. The cheapest flight overall (basic economy at $124) 2. The cheapest flight in Economy cabin class specifically ($207) This caused well-functioning agents to reasonably choose basic economy ($124 via ATL→DFW→EWR) instead of the expected economy class answer ($207 via ATL→LGA→PHL). Changes: - Updated reason_for_call: "cheapest economy flight" → "cheapest flight in Economy cabin class (not Basic Economy)" - Updated purpose descriptions to explicitly mention "Economy cabin class (not Basic Economy)" This clarification ensures agents understand they should find the cheapest flight specifically in the Economy cabin class, not just the overall cheapest option which would be Basic Economy. The policy states "basic economy is its own class, completely distinct from economy", so this distinction needs to be explicit in user requests. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

Problem 23's description was correct but could be clearer about WHY multiple bookings are needed. The scenario tests whether agents understand the one-certificate-per-reservation policy constraint. Changes: - Updated purpose from: "Complex transaction where multiple bookings need to be made with payment efficiently split across them to minimize charges to a Mastercard." - To: "Complex transaction testing understanding of one-certificate-per-reservation policy. Multiple bookings must be made to use three certificates, with payment efficiently split across them to minimize charges to a Mastercard." This clarification helps developers understand that the multiple bookings are not arbitrary but specifically required due to the policy limiting each reservation to one certificate. The user has three certificates to use, necessitating three separate bookings. Policy reference: "Each reservation can use at most one travel certificate, at most one credit card, and at most three gift cards." 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

Problem 34 had ambiguous instructions about what to do when requested changes exceed the $200 budget. Agents reasonably offered economy alternatives that saved money, but were marked wrong. The ambiguity: - User requests: nonstop flight next day, return delay, BUSINESS class, 2 bags - Business class option costs $222 extra (exceeds $200 budget) - Economy class option actually SAVES $186 (well within budget) - Original instruction: "If the total costs for all your changes is above your budget of $200, don't make any changes" This was unclear whether: A) Reject ALL changes if business class exceeds budget (intended) B) Offer economy alternatives that stay within budget (reasonable) Changes: - Clarified that user wants "all these changes as a complete package" - Explicitly states "do not accept partial changes, alternatives, or downgrades like economy class" - Makes it clear this is an all-or-nothing decision based on the exact requested combination Cost validation: - Current reservation: $503 (economy, no bags) - Business option: $725 ($222 extra - over budget) - Economy option: $317 ($186 refund - under budget) - Bags: FREE for Gold member in both classes The evaluation expects NO changes (empty actions array), which now aligns with the clarified all-or-nothing instruction. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

HYZ17 · 2025-11-14T03:20:22Z

Nice fix, but the logging.error('Unable to validate agent_msg: ', e) should be logger, since logging is not imported or defined.

Policy changes: - Basic economy reservations cannot have flight segments modified regardless of subsequent cabin changes - Cabin upgrades do not remove flight modification restrictions - Basic economy cancellation limited to: 24hr window, airline cancellation, or insurance with covered reason (health/weather) - Basic economy refunds limited to airline credits unless user has insurance Task fixes: - Task 7: Accept basic economy cancellation refusal, expected total $1,004 - Task 39: Add health reason for cancellation - Task 44: Add health reason and explicit credit card payment preference DB updates: - K1NW8N: basic_economy -> economy, insurance: yes - OWZ4XL: basic_economy -> economy 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

victorb-sierra · 2025-11-24T01:45:53Z

Sorry, just getting to this! We now have user and agent TerminationReason (

tau2-bench/src/tau2/data_model/simulation.py

Line 317 in a969a0c

class TerminationReason(str, Enum):

) . Added comments.

…h/tau2-bench

dhh1995 and others added 19 commits June 16, 2025 15:40

update README, update type in fig, add num tasks cli

8ca29bd

Merge pull request sierra-research#4 from sierra-research/fix_readme_…

15194be

…0616 update README, add num tasks cli, update python version requirement

Merge branch 'main' into dev

b19f077

Catch exceptions in validating agent_msg to handle failures gracefully

6c81cf4

GaganM changed the title ~~Catch exceptions in validating agent_msg to handle failures gracefully~~ Catch exceptions in validating agent_msg to handle failures gracefully CLOSES #42 Sep 10, 2025

GaganM changed the title ~~Catch exceptions in validating agent_msg to handle failures gracefully CLOSES #42~~ Catch exceptions in validating agent_msg to handle failures gracefully Sep 10, 2025

GaganM marked this pull request as ready for review September 10, 2025 14:53

GaganM requested a review from victorb-sierra as a code owner September 10, 2025 14:53

GaganM added 4 commits December 4, 2025 12:25

Merge commit 'refs/pull/26/head' of https://github.com/sierra-researc…

ca9949e

…h/tau2-bench

Merge commit 'refs/pull/27/head' of https://github.com/sierra-researc…

8489b09

…h/tau2-bench

Merge commit 'refs/pull/28/head' of https://github.com/sierra-researc…

6ee979f

…h/tau2-bench

Merge commit 'refs/pull/29/head' of https://github.com/sierra-researc…

86385f7

…h/tau2-bench

GaganM added 10 commits December 4, 2025 12:26

Merge commit 'refs/pull/30/head' of https://github.com/sierra-researc…

53e95bf

…h/tau2-bench

Merge commit 'refs/pull/31/head' of https://github.com/sierra-researc…

e2e5ac6

…h/tau2-bench

Merge commit 'refs/pull/33/head' of https://github.com/sierra-researc…

4cbf78a

…h/tau2-bench

Merge commit 'refs/pull/34/head' of https://github.com/sierra-researc…

1e5776c

…h/tau2-bench

Merge commit 'refs/pull/35/head' of https://github.com/sierra-researc…

ddb80d3

…h/tau2-bench

Merge commit 'refs/pull/36/head' of https://github.com/sierra-researc…

d46e4a3

…h/tau2-bench

Merge commit 'refs/pull/37/head' of https://github.com/sierra-researc…

7e027db

…h/tau2-bench

Merge commit 'refs/pull/38/head' of https://github.com/sierra-researc…

89620a4

…h/tau2-bench

Merge commit 'refs/pull/39/head' of https://github.com/sierra-researc…

309bdb2

…h/tau2-bench

Merge commit 'refs/pull/94/head' of https://github.com/sierra-researc…

81b4ea5

…h/tau2-bench

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Catch exceptions in validating agent_msg to handle failures gracefully#43

Catch exceptions in validating agent_msg to handle failures gracefully#43
GaganM wants to merge 34 commits intosierra-research:mainfrom
GaganM:main

GaganM commented Sep 10, 2025 •

edited

Loading

Uh oh!

HYZ17 commented Nov 14, 2025

Uh oh!

victorb-sierra commented Nov 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

GaganM commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HYZ17 commented Nov 14, 2025

Uh oh!

victorb-sierra commented Nov 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

GaganM commented Sep 10, 2025 •

edited

Loading