Clarify telecom data refueling rule in manual by carol-xrl · Pull Request #105 · sierra-research/tau2-bench

carol-xrl · 2025-12-04T14:10:34Z

Summary

This PR solve the Reward Logic Flaw mentioned in issue#104 by updating the telecom manual by replacing the previous wording:

“The maximum amount of data that can be refueled is 2GB.”

with the more explicit rule:

“Whenever refueling is considered, the amount of data to be refueled must be exactly 2GB.”

Only this line in the manual is modified.

Why This Change Is Needed

The previous manual described a maximum limit, suggesting that the assistant may refuel any amount up to 2GB.
However, the reward function uses a strict assertion:


assert_data_refueling_amount == 2.0

meaning the evaluation framework requires exactly 2GB whenever refueling occurs.

This mismatch causes reasonable assistant behavior (e.g., refueling less than 2GB) to be incorrectly judged as failure, resulting in an inaccurate and unfair evaluation score.

[fix issue sierra-research#104: Reward Logic Flaw: Hard-Coded 2GB Refuel Requirement Causes Incorrect Evaluation] by replaceing "The maximum amount of data that can be refueled is 2GB" with a clearer rule: "Whenever refueling is considered, the amount of data to be refueled must be exactly 2GB."

[fix issue#104: Reward Logic Flaw: Hard-Coded 2GB Refuel Requirement Causes Incorrect Evaluation] by replaceing "The maximum amount of data that can be refueled is 2GB" with a clearer rule: "Whenever refueling is considered, the amount of data to be refueled must be exactly 2GB."

carol-xrl added 2 commits December 4, 2025 21:56

carol-xrl requested a review from victorb-sierra as a code owner December 4, 2025 14:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify telecom data refueling rule in manual#105

Clarify telecom data refueling rule in manual#105
carol-xrl wants to merge 2 commits intosierra-research:mainfrom
carol-xrl:main

carol-xrl commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

carol-xrl commented Dec 4, 2025

Summary

Why This Change Is Needed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant