Feature/tool call error tracking and config#85
Open
songwangnlp wants to merge 4 commits intosierra-research:mainfrom
Open
Feature/tool call error tracking and config#85songwangnlp wants to merge 4 commits intosierra-research:mainfrom
songwangnlp wants to merge 4 commits intosierra-research:mainfrom
Conversation
added 2 commits
November 17, 2025 15:35
This commit adds robust error tracking for tool call parsing and fixes
a critical bug where empty tool_calls lists caused IndexError crashes
in the orchestrator.
Changes:
1. Fixed is_tool_call() method (message.py)
- Now checks for non-empty list, not just non-None
- Prevents treating empty tool_calls as valid tool calls
- Fixes "list index out of range" error in orchestrator
2. Added tool call error tracking system (llm_utils.py)
- Global counters: TOOL_CALL_ERROR_COUNTS, TOOL_CALL_ERROR_DETAILS
- Tracks 6 error types:
* JSON_DECODE_ERROR: Initial JSON parsing failures
* UNEXPECTED_PARSE_ERROR: Unexpected exceptions during parsing
* DOUBLE_ENCODED_STRING: Arguments are double-encoded strings
* DOUBLE_DECODE_FAILURE: Second decode attempt fails
* NULL_ARGUMENTS: Arguments are null/None
* UNEXPECTED_TYPE: Arguments have unexpected type
- Enhanced error logging with detailed context (tool name, raw args)
- Converts empty tool_calls list to None to maintain consistency
3. Added utility functions for error analysis (llm_utils.py)
- save_tool_call_error_analysis(): Export error stats to file
- get_tool_call_error_stats(): Get current error counts
- reset_tool_call_error_tracking(): Clear error counters
4. Improved error handling
- Tool calls with parsing errors now use empty dict instead of crashing
- Preserves backward compatibility
- All errors are logged and tracked for analysis
Bug Fix: The orchestrator was crashing with "list index out of range"
when tool_calls was an empty list. This occurred because is_tool_call()
returned True for empty lists, causing the orchestrator to attempt
accessing tool_msgs[0] on an empty list.
Benefits:
- Eliminates crashes from empty tool calls
- Provides visibility into tool call parsing issues
- Enables analysis of model output quality
- Maintains backward compatibility with existing code
Changes: - Update DEFAULT_MAX_CONCURRENCY from 3 to 10 - Make API_PORT configurable via environment variable - Add debug logging for tool calls in interface_agent - Improve halt signal handling in orchestrator - Remove .env.example file
added 2 commits
November 17, 2025 16:54
This update ensures tool call errors are automatically saved to timestamped files
on program exit, preventing overwrites between different evaluation runs.
Changes:
- Automatic error logging on exit using atexit handler
- Timestamped filenames: error_call_analysis_{timestamp}.txt
- Only saves if errors occurred (avoids empty files)
- Add example script demonstrating new error tracking features
- Functions: save_tool_call_error_analysis(), get_tool_call_error_stats(), reset_tool_call_error_tracking()
Prevents data loss by preserving error logs from each run with unique timestamps.
This commit includes two enhancements:
1. Add total finished conversations metric to agent metrics display
- Added total_finished field to AgentMetrics model
- Display shows count of completed task runs at top of metrics panel
- Provides visibility into evaluation progress and completion
2. Add ast.literal_eval fallback for Python dict syntax in tool calls
- Handles LLM responses using single quotes (Python dict) instead of JSON
- Fallback parsing when json.loads fails on single-quoted arguments
- Prevents validation errors for arguments like {'id': 'L1001'}
- Adds recovery logging for successful literal_eval parsing
Files modified:
- src/tau2/metrics/agent_metrics.py: Add total_finished field and calculation
- src/tau2/utils/display.py: Display total finished conversations
- src/tau2/utils/llm_utils.py: Add ast.literal_eval fallback for tool args
Collaborator
|
Before we can review, please make sure to follow https://github.com/sierra-research/tau2-bench/blob/main/CONTRIBUTING.md |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.