Skip to content

Feature/tool call error tracking and config#85

Open
songwangnlp wants to merge 4 commits intosierra-research:mainfrom
songwangnlp:feature/tool-call-error-tracking-and-config
Open

Feature/tool call error tracking and config#85
songwangnlp wants to merge 4 commits intosierra-research:mainfrom
songwangnlp:feature/tool-call-error-tracking-and-config

Conversation

@songwangnlp
Copy link

No description provided.

Song Wang added 2 commits November 17, 2025 15:35
This commit adds robust error tracking for tool call parsing and fixes
a critical bug where empty tool_calls lists caused IndexError crashes
in the orchestrator.

Changes:

1. Fixed is_tool_call() method (message.py)
   - Now checks for non-empty list, not just non-None
   - Prevents treating empty tool_calls as valid tool calls
   - Fixes "list index out of range" error in orchestrator

2. Added tool call error tracking system (llm_utils.py)
   - Global counters: TOOL_CALL_ERROR_COUNTS, TOOL_CALL_ERROR_DETAILS
   - Tracks 6 error types:
     * JSON_DECODE_ERROR: Initial JSON parsing failures
     * UNEXPECTED_PARSE_ERROR: Unexpected exceptions during parsing
     * DOUBLE_ENCODED_STRING: Arguments are double-encoded strings
     * DOUBLE_DECODE_FAILURE: Second decode attempt fails
     * NULL_ARGUMENTS: Arguments are null/None
     * UNEXPECTED_TYPE: Arguments have unexpected type
   - Enhanced error logging with detailed context (tool name, raw args)
   - Converts empty tool_calls list to None to maintain consistency

3. Added utility functions for error analysis (llm_utils.py)
   - save_tool_call_error_analysis(): Export error stats to file
   - get_tool_call_error_stats(): Get current error counts
   - reset_tool_call_error_tracking(): Clear error counters

4. Improved error handling
   - Tool calls with parsing errors now use empty dict instead of crashing
   - Preserves backward compatibility
   - All errors are logged and tracked for analysis

Bug Fix: The orchestrator was crashing with "list index out of range"
when tool_calls was an empty list. This occurred because is_tool_call()
returned True for empty lists, causing the orchestrator to attempt
accessing tool_msgs[0] on an empty list.

Benefits:
- Eliminates crashes from empty tool calls
- Provides visibility into tool call parsing issues
- Enables analysis of model output quality
- Maintains backward compatibility with existing code
Changes:
- Update DEFAULT_MAX_CONCURRENCY from 3 to 10
- Make API_PORT configurable via environment variable
- Add debug logging for tool calls in interface_agent
- Improve halt signal handling in orchestrator
- Remove .env.example file
Song Wang added 2 commits November 17, 2025 16:54
This update ensures tool call errors are automatically saved to timestamped files
on program exit, preventing overwrites between different evaluation runs.

Changes:
- Automatic error logging on exit using atexit handler
- Timestamped filenames: error_call_analysis_{timestamp}.txt
- Only saves if errors occurred (avoids empty files)
- Add example script demonstrating new error tracking features
- Functions: save_tool_call_error_analysis(), get_tool_call_error_stats(), reset_tool_call_error_tracking()

Prevents data loss by preserving error logs from each run with unique timestamps.
This commit includes two enhancements:

1. Add total finished conversations metric to agent metrics display
   - Added total_finished field to AgentMetrics model
   - Display shows count of completed task runs at top of metrics panel
   - Provides visibility into evaluation progress and completion

2. Add ast.literal_eval fallback for Python dict syntax in tool calls
   - Handles LLM responses using single quotes (Python dict) instead of JSON
   - Fallback parsing when json.loads fails on single-quoted arguments
   - Prevents validation errors for arguments like {'id': 'L1001'}
   - Adds recovery logging for successful literal_eval parsing

Files modified:
- src/tau2/metrics/agent_metrics.py: Add total_finished field and calculation
- src/tau2/utils/display.py: Display total finished conversations
- src/tau2/utils/llm_utils.py: Add ast.literal_eval fallback for tool args
@victorb-sierra
Copy link
Collaborator

Before we can review, please make sure to follow https://github.com/sierra-research/tau2-bench/blob/main/CONTRIBUTING.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants