Skip to content

Tau2 partial - introduce partial credit scoring for tool calling, enabling Process Reward Models (PRM) over traditional Outcome Reward Models (ORM)#151

Open
sulbhajain wants to merge 3 commits intosierra-research:mainfrom
sulbhajain:tau2-partial-green
Open

Tau2 partial - introduce partial credit scoring for tool calling, enabling Process Reward Models (PRM) over traditional Outcome Reward Models (ORM)#151
sulbhajain wants to merge 3 commits intosierra-research:mainfrom
sulbhajain:tau2-partial-green

Conversation

@sulbhajain
Copy link

Partial Rewards for Tool Calling: PRM vs ORM Support

Overview

This PR introduces partial credit scoring for tool calling evaluation in tau2, enabling Process Reward Models (PRM) over traditional Outcome Reward Models (ORM). This enhancement provides granular feedback signals essential for effective agent training, reinforcement learning, and reward model development.

Motivation

The Problem with Binary Evaluation (ORM)

Traditional benchmarks use Outcome Reward Models (ORM) that apply all-or-nothing scoring:

  • ✅ Perfect execution → Score: 1.0
  • ❌ Any deviation → Score: 0.0

This binary approach creates significant limitations:

Example Scenario:

Task: Book a flight from NYC to LAX on March 15th
Agent A: Calls search_flights(origin="NYC", destination="LAX", date="March 15")
Agent B: Calls search_flights(origin="NYC", destination="LAX") // missing date
Agent C: Calls get_weather(location="NYC")

Traditional ORM scoring: All three agents → 0.0 (indistinguishable)

Why Partial Credit Matters (PRM)

Process Reward Models (PRM) evaluate intermediate steps, providing:

  1. Meaningful Learning Signals: Agents receive feedback on what they got right, even when imperfect
  2. Gradient for Optimization: Creates smoother optimization landscape for RL and fine-tuning
  3. Realistic Performance Assessment: Reflects real-world value where partial success often helps users
  4. Accelerated Training: Models learn incrementally rather than through trial-and-error

Same scenario with PRM:

Agent A: Score 1.0 (perfect execution)
Agent B: Score 0.5 (correct function, incomplete parameters)  
Agent C: Score 0.0 (wrong function entirely)

Clear performance differentiation enables targeted improvements

Implementation

Scoring Framework

The partial credit system uses a three-tier scoring approach:

Score Criteria Example
1.0 Exact function name + all required parameters correct search_flights(origin="NYC", destination="LAX", date="2024-03-15")
0.5 Correct function name + incomplete/incorrect parameters search_flights(origin="NYC", destination="LAX")
0.0 Wrong function or completely irrelevant call get_weather(location="NYC")

Usage

Enable partial credit scoring with the allow_partial flag:

# Test case configuration
test_case = {
    "id": 50,
    "task": "Search for flights from NYC to LAX",
    "allow_partial": True,  # Enable partial credit
    "expected_tools": [
        {
            "function": "search_flights",
            "parameters": {
                "origin": "NYC",
                "destination": "LAX",
                "date": "2024-03-15"
            }
        }
    ]
}

Test Coverage

New test cases added in the airline domain:

  • Test Cases 50-52: Comprehensive partial credit scenarios with allow_partial: true
  • Coverage includes: missing parameters, incorrect types, parameter order variations

Impact on Benchmarking

Score Differences: ORM vs PRM

Scenario Traditional ORM Our PRM Delta
Perfect execution 1.0 1.0 0.0
Right function, 1 missing param 0.0 0.5 +0.5
Right function, wrong param type 0.0 0.5 +0.5
Wrong function 0.0 0.0 0.0
Average improvement - - ~0.25-0.35

Key Improvements

  1. Higher Overall Scores: Reflects actual agent capability rather than penalizing minor mistakes
  2. Granular Differentiation: Reveals meaningful performance gaps between agents
  3. Actionable Insights: Distinguishes function selection errors from parameter extraction errors
  4. Training Efficiency: Provides richer gradient signals for RLHF and fine-tuning

Real-World Alignment

Partial credit better reflects production scenarios where:

  • Users benefit from partial task completion
  • Incremental progress has value (e.g., finding flights with flexible dates)
  • Perfect execution on first try is unrealistic
  • Agent recovery from partial success is possible

Use Cases

1. Reinforcement Learning with Human Feedback (RLHF)

# Reward model training benefits from partial credit
reward = calculate_partial_score(agent_action, ground_truth)
# 0.5 reward signals "right direction but incomplete"
# Enables policy gradient optimization

2. Process Reward Model Training

# PRM evaluates each step in multi-step tasks
for step in agent_trajectory:
    step_reward = evaluate_with_partial_credit(step)
    # Accumulate granular feedback for each decision point

3. Agent Debugging & Analysis

# Identify specific failure modes
if score == 0.5:
    # Agent understands intent but struggles with parameters
    # → Focus training on parameter extraction
elif score == 0.0:
    # Agent misunderstands task
    # → Focus on function selection and task interpretation

Resources

Migration Guide

For Existing Benchmarks

# Before (ORM - binary scoring)
result = evaluate_agent(task, strict_mode=True)
# Returns: 0.0 or 1.0

# After (PRM - partial credit)
result = evaluate_agent(task, allow_partial=True)
# Returns: 0.0, 0.5, or 1.0

Backward Compatibility

  • Default behavior remains ORM (binary) for backward compatibility
  • Explicit allow_partial=True required to enable PRM
  • All existing test cases continue to work without modification

Technical Details

Evaluation Logic

def calculate_partial_score(predicted, expected):
    """
    Calculate partial credit score for tool call
    
    Returns:
        1.0: Exact match (function + all params)
        0.5: Function match, params incomplete/incorrect
        0.0: Function mismatch or completely wrong
    """
    if predicted.function != expected.function:
        return 0.0
    
    if params_exact_match(predicted.params, expected.params):
        return 1.0
    
    if params_partial_match(predicted.params, expected.params):
        return 0.5
    
    return 0.0

Aggregation for Multi-Tool Tasks

For tasks requiring multiple tool calls:

# Average partial scores across all required tools
total_score = sum(scores) / len(expected_tools)

# Example: 4 tools required, agent gets [1.0, 0.5, 0.5, 0.0]
# Final score: 2.0 / 4 = 0.5

Future Work

  • Fine-grained partial credit: 0.25 increments for parameter subsets
  • Weighted parameters: Critical params worth more than optional ones
  • Temporal credit: Earlier correct attempts weighted higher
  • Multi-step PRM: Track credit across conversation turns

Contributing

We welcome contributions to expand partial credit evaluation:

  • Additional scoring granularity
  • Domain-specific credit rules
  • New test cases with edge cases
  • Integration with other benchmarks

@sulbhajain sulbhajain changed the title Tau2 partial green Tau2 partial - introduce partial credit scoring for tool calling, enabling Process Reward Models (PRM) over traditional Outcome Reward Models (ORM) Feb 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant