Feature/tool call error tracking and config by songwangnlp · Pull Request #85 · sierra-research/tau2-bench

songwangnlp · 2025-11-17T23:51:13Z

No description provided.

This commit adds robust error tracking for tool call parsing and fixes a critical bug where empty tool_calls lists caused IndexError crashes in the orchestrator. Changes: 1. Fixed is_tool_call() method (message.py) - Now checks for non-empty list, not just non-None - Prevents treating empty tool_calls as valid tool calls - Fixes "list index out of range" error in orchestrator 2. Added tool call error tracking system (llm_utils.py) - Global counters: TOOL_CALL_ERROR_COUNTS, TOOL_CALL_ERROR_DETAILS - Tracks 6 error types: * JSON_DECODE_ERROR: Initial JSON parsing failures * UNEXPECTED_PARSE_ERROR: Unexpected exceptions during parsing * DOUBLE_ENCODED_STRING: Arguments are double-encoded strings * DOUBLE_DECODE_FAILURE: Second decode attempt fails * NULL_ARGUMENTS: Arguments are null/None * UNEXPECTED_TYPE: Arguments have unexpected type - Enhanced error logging with detailed context (tool name, raw args) - Converts empty tool_calls list to None to maintain consistency 3. Added utility functions for error analysis (llm_utils.py) - save_tool_call_error_analysis(): Export error stats to file - get_tool_call_error_stats(): Get current error counts - reset_tool_call_error_tracking(): Clear error counters 4. Improved error handling - Tool calls with parsing errors now use empty dict instead of crashing - Preserves backward compatibility - All errors are logged and tracked for analysis Bug Fix: The orchestrator was crashing with "list index out of range" when tool_calls was an empty list. This occurred because is_tool_call() returned True for empty lists, causing the orchestrator to attempt accessing tool_msgs[0] on an empty list. Benefits: - Eliminates crashes from empty tool calls - Provides visibility into tool call parsing issues - Enables analysis of model output quality - Maintains backward compatibility with existing code

Changes: - Update DEFAULT_MAX_CONCURRENCY from 3 to 10 - Make API_PORT configurable via environment variable - Add debug logging for tool calls in interface_agent - Improve halt signal handling in orchestrator - Remove .env.example file

This update ensures tool call errors are automatically saved to timestamped files on program exit, preventing overwrites between different evaluation runs. Changes: - Automatic error logging on exit using atexit handler - Timestamped filenames: error_call_analysis_{timestamp}.txt - Only saves if errors occurred (avoids empty files) - Add example script demonstrating new error tracking features - Functions: save_tool_call_error_analysis(), get_tool_call_error_stats(), reset_tool_call_error_tracking() Prevents data loss by preserving error logs from each run with unique timestamps.

This commit includes two enhancements: 1. Add total finished conversations metric to agent metrics display - Added total_finished field to AgentMetrics model - Display shows count of completed task runs at top of metrics panel - Provides visibility into evaluation progress and completion 2. Add ast.literal_eval fallback for Python dict syntax in tool calls - Handles LLM responses using single quotes (Python dict) instead of JSON - Fallback parsing when json.loads fails on single-quoted arguments - Prevents validation errors for arguments like {'id': 'L1001'} - Adds recovery logging for successful literal_eval parsing Files modified: - src/tau2/metrics/agent_metrics.py: Add total_finished field and calculation - src/tau2/utils/display.py: Display total finished conversations - src/tau2/utils/llm_utils.py: Add ast.literal_eval fallback for tool args

victorb-sierra · 2025-11-24T01:32:23Z

Before we can review, please make sure to follow https://github.com/sierra-research/tau2-bench/blob/main/CONTRIBUTING.md

Song Wang added 2 commits November 17, 2025 15:35

Update configuration and add debug logging

a873b60

Changes: - Update DEFAULT_MAX_CONCURRENCY from 3 to 10 - Make API_PORT configurable via environment variable - Add debug logging for tool calls in interface_agent - Improve halt signal handling in orchestrator - Remove .env.example file

songwangnlp requested a review from victorb-sierra as a code owner November 17, 2025 23:51

Song Wang added 2 commits November 17, 2025 16:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/tool call error tracking and config#85

Feature/tool call error tracking and config#85
songwangnlp wants to merge 4 commits intosierra-research:mainfrom
songwangnlp:feature/tool-call-error-tracking-and-config

songwangnlp commented Nov 17, 2025

Uh oh!

victorb-sierra commented Nov 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

songwangnlp commented Nov 17, 2025

Uh oh!

victorb-sierra commented Nov 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants