Skip to content

Catch exceptions in validating agent_msg to handle failures gracefully#43

Open
GaganM wants to merge 34 commits intosierra-research:mainfrom
GaganM:main
Open

Catch exceptions in validating agent_msg to handle failures gracefully#43
GaganM wants to merge 34 commits intosierra-research:mainfrom
GaganM:main

Conversation

@GaganM
Copy link

@GaganM GaganM commented Sep 10, 2025

Handles failures in agent_msg.validate() gracefully to avoid failures.

This fixes #42

dhh1995 and others added 19 commits June 16, 2025 15:40
…0616

update README, add num tasks cli, update python version requirement
…on to set a TAU_DATA_DIR to point to the data. Added a fall back to local source if this is not set. Added tau2 check-data cli for people to check data install

fixed num-tasks flag. Fix display of tasks name in cli.
Fixed issues where evaluations expected agents to cancel or modify reservations
with already-flown flights, which violates airline policy stating "If any portion
of the flight has already been flown, the agent cannot help and transfer is needed."

Changes:
- Updated NQNU5R reservation flights from May 13-14 to May 20-21 (Problems 9 & 37)
- Updated EUJUY6 reservation flights from May 14/16 to May 20/22 (Problem 36)
- Fixed assertion typo in Problem 9 (May 12 -> May 22 to match action check)

These changes ensure evaluations don't expect policy violations while preserving
the original test intent.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Problem 23 expected bookings for "Aarav Sanchez" and "Evelyn Wilson" but
the user's saved passengers are actually "Raj Sanchez" and "Liam Wilson".

Updated:
- User instructions: Changed "Aarav" to "Raj" and "Evelyn" to "Liam"
- Action arguments: Updated passenger first names in booking actions
- NL assertions: Updated expected passenger names to match

This ensures evaluations correctly check for the passengers that actually
exist in the user's account.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Problem 35 expected DOB "1985-04-04" but user aarav_ahmed_6699's actual
DOB in the database is "1981-05-26".

Updated the expected DOB in the booking action to match the user's profile,
ensuring the evaluation checks for the correct date of birth.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Problem 33 incorrectly expected agents to charge for 2 checked bags when the
airline policy grants Gold members 3 free checked bags in economy class.

Changes:
- Updated nonfree_baggages from 2 to 0 (since user wants 2 bags, both are free)
- Updated NL assertion from "2 non-free baggages" to "2 free baggages"

Policy reference:
- Gold members get 3 free checked bags in economy class
- User Yara Garcia is a Gold member
- She only wants 2 bags, so both should be free

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Problem 21 had ambiguous instructions about what "fastest return trip" means,
which could be interpreted as either shortest total travel time or earliest
arrival time. This led to confusion about payment methods since different
flight combinations have different costs.

Changes:
- Clarified user instruction from "fastest return trip possible" to
  "return option with the shortest total travel time (including layover)"
- This makes it unambiguous that HAT290 + HAT175 is the correct choice
  (1-hour layover vs 7-hour for the cheaper option)

The evaluation already expects:
- Flights HAT290 + HAT175 (shortest total time)
- Payment via gift_card_6276644 ($113 - smallest viable for $59 charge)
- This is correct since gift_card_7480005 ($6) would be insufficient

This fix ensures agents understand they should optimize for total travel time,
not just arrival time or cost, making the evaluation internally consistent.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Problem 14 had incorrect cost information in the description that didn't match
the expected actions. The description claimed $871 total cost with $44 mastercard
charge, but the mathematically correct amounts are much higher.

Root cause: Description showed per-passenger costs ($871) but the booking is
for 3 passengers, making the actual total $2613.

Changes:
- Updated problem description: $871 → $2613 (total for 3 passengers)
- Updated problem description: $44 → $1786 (correct mastercard charge)
- Updated communicate_info: "44" → "1786"
- Updated NL assertion: "$44" → "$1786"

Math verification:
- Business class cost: $871 × 3 passengers = $2613 total
- Payment: $500 (max 1 certificate) + $327 (gift cards) = $827
- Mastercard charge: $2613 - $827 = $1786

The expected actions were already correct - this fixes the inconsistent description.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Problem 32 failed when agents used a single API call to upgrade cabin and
change flights simultaneously, even though this achieves the same result
more efficiently than the expected two-step process.

Modified user instructions to explicitly request step-by-step changes:
- User now specifically asks to upgrade cabin first and confirm
- Then separately change flights to nonstop
- Added justification: user wants to verify each change individually

This makes the two-step process user-driven rather than an arbitrary constraint,
while preserving the evaluation's intent to test sequential API operations.

The change maintains realism - a stressed passenger wanting confirmation at
each step during a family emergency is believable behavior.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Problem 42 had a logical contradiction where the user needed to "be in Boston
on May 22" but the evaluation expected keeping a flight LEAVING Boston that
same day, causing all agents to correctly cancel it.

Changes:
- Updated user constraint: "will be in Boston on May 22" → "will be leaving Boston on May 22"
- Updated problem description to match the constraint change

This minimal change (3 words) resolves the contradiction by making the PUNERT
flight (BOS → DEN on May 22) logically consistent with the user's travel needs.

Now the evaluation correctly tests whether agents can reason that:
- User needs to leave Boston on May 22 (keep PUNERT flight)
- User cannot be in multiple cities simultaneously (cancel conflicting flights)

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Problem 39 had a misleading description claiming "tool does not allow to cancel
without a refund" when the tool actually always processes refunds regardless
of policy eligibility.

Changed description to clarify that:
- Agents must follow airline policy about refund eligibility
- Tool itself would process any cancellation (with refund)
- Agents are responsible for policy enforcement before using the tool

This aligns with the airline policy statement: "The API does not check that
cancellation rules are met, so the agent must make sure the rules apply
before calling the API!"

The evaluation correctly tests whether agents understand they should only
cancel flights that are policy-eligible for refunds, not rely on tool
validation.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Problem 14 had an impossible constraint where users would "only book if
mastercard charges are less than original" but original was $0 (paid with
gift card), making any positive charge fail the constraint.

This caused all agents to correctly refuse the booking, but the evaluation
expected them to proceed anyway.

Changes:
- Updated constraint from "less than original charges" to "less than $2000"
- Since expected mastercard charge is $1786, this allows the booking to proceed
- Creates objective, measurable criteria instead of impossible constraint

This preserves the test's intent (cost-conscious decision making) while
making the constraint achievable and realistic.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Problem 29 expected agents to change destination from LGA to JFK, which
violates airline policy: "reservations can be modified without changing
the origin, destination, and trip type."

This caused agents following policy to receive score 0 when they correctly
refused the prohibited destination change.

Changes:
- Updated VA5SGQ initial state: DTW↔LGA connecting → DTW↔JFK direct
- Used afternoon flights initially: HAT240 (4pm) and HAT088 ($358 total)
- Updated reason_for_call: request early morning flights instead of destination change
- Agent now changes to HAT169 (4am-6am) and HAT033 ($282 total)

The scenario now tests flight time optimization (afternoon→morning) rather
than policy violations. The agent saves $76 while meeting the "arrive
before 7am" constraint, preserving test complexity without requiring
policy violations.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Problem 15 used ambiguous phrasing "cheapest economy flight" which could mean:
1. The cheapest flight overall (basic economy at $124)
2. The cheapest flight in Economy cabin class specifically ($207)

This caused well-functioning agents to reasonably choose basic economy
($124 via ATL→DFW→EWR) instead of the expected economy class answer
($207 via ATL→LGA→PHL).

Changes:
- Updated reason_for_call: "cheapest economy flight" → "cheapest flight in Economy cabin class (not Basic Economy)"
- Updated purpose descriptions to explicitly mention "Economy cabin class (not Basic Economy)"

This clarification ensures agents understand they should find the cheapest
flight specifically in the Economy cabin class, not just the overall
cheapest option which would be Basic Economy.

The policy states "basic economy is its own class, completely distinct
from economy", so this distinction needs to be explicit in user requests.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Problem 23's description was correct but could be clearer about WHY
multiple bookings are needed. The scenario tests whether agents understand
the one-certificate-per-reservation policy constraint.

Changes:
- Updated purpose from: "Complex transaction where multiple bookings need to be made with payment efficiently split across them to minimize charges to a Mastercard."
- To: "Complex transaction testing understanding of one-certificate-per-reservation policy. Multiple bookings must be made to use three certificates, with payment efficiently split across them to minimize charges to a Mastercard."

This clarification helps developers understand that the multiple bookings
are not arbitrary but specifically required due to the policy limiting
each reservation to one certificate. The user has three certificates to
use, necessitating three separate bookings.

Policy reference: "Each reservation can use at most one travel certificate,
at most one credit card, and at most three gift cards."

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Problem 34 had ambiguous instructions about what to do when requested
changes exceed the $200 budget. Agents reasonably offered economy
alternatives that saved money, but were marked wrong.

The ambiguity:
- User requests: nonstop flight next day, return delay, BUSINESS class, 2 bags
- Business class option costs $222 extra (exceeds $200 budget)
- Economy class option actually SAVES $186 (well within budget)
- Original instruction: "If the total costs for all your changes is above your budget of $200, don't make any changes"

This was unclear whether:
A) Reject ALL changes if business class exceeds budget (intended)
B) Offer economy alternatives that stay within budget (reasonable)

Changes:
- Clarified that user wants "all these changes as a complete package"
- Explicitly states "do not accept partial changes, alternatives, or downgrades like economy class"
- Makes it clear this is an all-or-nothing decision based on the exact requested combination

Cost validation:
- Current reservation: $503 (economy, no bags)
- Business option: $725 ($222 extra - over budget)
- Economy option: $317 ($186 refund - under budget)
- Bags: FREE for Gold member in both classes

The evaluation expects NO changes (empty actions array), which now
aligns with the clarified all-or-nothing instruction.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
@GaganM GaganM changed the title Catch exceptions in validating agent_msg to handle failures gracefully Catch exceptions in validating agent_msg to handle failures gracefully CLOSES #42 Sep 10, 2025
@GaganM GaganM changed the title Catch exceptions in validating agent_msg to handle failures gracefully CLOSES #42 Catch exceptions in validating agent_msg to handle failures gracefully Sep 10, 2025
@GaganM GaganM marked this pull request as ready for review September 10, 2025 14:53
@HYZ17
Copy link

HYZ17 commented Nov 14, 2025

Nice fix, but the logging.error('Unable to validate agent_msg: ', e) should be logger, since logging is not imported or defined.

Policy changes:
- Basic economy reservations cannot have flight segments modified regardless
  of subsequent cabin changes
- Cabin upgrades do not remove flight modification restrictions
- Basic economy cancellation limited to: 24hr window, airline cancellation,
  or insurance with covered reason (health/weather)
- Basic economy refunds limited to airline credits unless user has insurance

Task fixes:
- Task 7: Accept basic economy cancellation refusal, expected total $1,004
- Task 39: Add health reason for cancellation
- Task 44: Add health reason and explicit credit card payment preference

DB updates:
- K1NW8N: basic_economy -> economy, insurance: yes
- OWZ4XL: basic_economy -> economy

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@victorb-sierra
Copy link
Collaborator

Sorry, just getting to this! We now have user and agent TerminationReason (

class TerminationReason(str, Enum):
) . Added comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Error handling for agent_msg.validate()

5 participants