Tau2 partial - introduce partial credit scoring for tool calling, enabling Process Reward Models (PRM) over traditional Outcome Reward Models (ORM) by sulbhajain · Pull Request #151 · sierra-research/tau2-bench

sulbhajain · 2026-01-30T06:09:59Z

Partial Rewards for Tool Calling: PRM vs ORM Support

Overview

This PR introduces partial credit scoring for tool calling evaluation in tau2, enabling Process Reward Models (PRM) over traditional Outcome Reward Models (ORM). This enhancement provides granular feedback signals essential for effective agent training, reinforcement learning, and reward model development.

Motivation

The Problem with Binary Evaluation (ORM)

Traditional benchmarks use Outcome Reward Models (ORM) that apply all-or-nothing scoring:

✅ Perfect execution → Score: 1.0
❌ Any deviation → Score: 0.0

This binary approach creates significant limitations:

Example Scenario:

Task: Book a flight from NYC to LAX on March 15th
Agent A: Calls search_flights(origin="NYC", destination="LAX", date="March 15")
Agent B: Calls search_flights(origin="NYC", destination="LAX") // missing date
Agent C: Calls get_weather(location="NYC")

Traditional ORM scoring: All three agents → 0.0 (indistinguishable)

Why Partial Credit Matters (PRM)

Process Reward Models (PRM) evaluate intermediate steps, providing:

Meaningful Learning Signals: Agents receive feedback on what they got right, even when imperfect
Gradient for Optimization: Creates smoother optimization landscape for RL and fine-tuning
Realistic Performance Assessment: Reflects real-world value where partial success often helps users
Accelerated Training: Models learn incrementally rather than through trial-and-error

Same scenario with PRM:

Agent A: Score 1.0 (perfect execution)
Agent B: Score 0.5 (correct function, incomplete parameters)  
Agent C: Score 0.0 (wrong function entirely)

Clear performance differentiation enables targeted improvements

Implementation

Scoring Framework

The partial credit system uses a three-tier scoring approach:

Score	Criteria	Example
1.0	Exact function name + all required parameters correct	`search_flights(origin="NYC", destination="LAX", date="2024-03-15")`
0.5	Correct function name + incomplete/incorrect parameters	`search_flights(origin="NYC", destination="LAX")`
0.0	Wrong function or completely irrelevant call	`get_weather(location="NYC")`

Usage

Enable partial credit scoring with the allow_partial flag:

# Test case configuration
test_case = {
    "id": 50,
    "task": "Search for flights from NYC to LAX",
    "allow_partial": True,  # Enable partial credit
    "expected_tools": [
        {
            "function": "search_flights",
            "parameters": {
                "origin": "NYC",
                "destination": "LAX",
                "date": "2024-03-15"
            }
        }
    ]
}

Test Coverage

New test cases added in the airline domain:

Test Cases 50-52: Comprehensive partial credit scenarios with allow_partial: true
Coverage includes: missing parameters, incorrect types, parameter order variations

Impact on Benchmarking

Score Differences: ORM vs PRM

Scenario	Traditional ORM	Our PRM	Delta
Perfect execution	1.0	1.0	0.0
Right function, 1 missing param	0.0	0.5	+0.5
Right function, wrong param type	0.0	0.5	+0.5
Wrong function	0.0	0.0	0.0
Average improvement	-	-	~0.25-0.35

Key Improvements

Higher Overall Scores: Reflects actual agent capability rather than penalizing minor mistakes
Granular Differentiation: Reveals meaningful performance gaps between agents
Actionable Insights: Distinguishes function selection errors from parameter extraction errors
Training Efficiency: Provides richer gradient signals for RLHF and fine-tuning

Real-World Alignment

Partial credit better reflects production scenarios where:

Users benefit from partial task completion
Incremental progress has value (e.g., finding flights with flexible dates)
Perfect execution on first try is unrealistic
Agent recovery from partial success is possible

Use Cases

1. Reinforcement Learning with Human Feedback (RLHF)

# Reward model training benefits from partial credit
reward = calculate_partial_score(agent_action, ground_truth)
# 0.5 reward signals "right direction but incomplete"
# Enables policy gradient optimization

2. Process Reward Model Training

# PRM evaluates each step in multi-step tasks
for step in agent_trajectory:
    step_reward = evaluate_with_partial_credit(step)
    # Accumulate granular feedback for each decision point

3. Agent Debugging & Analysis

# Identify specific failure modes
if score == 0.5:
    # Agent understands intent but struggles with parameters
    # → Focus training on parameter extraction
elif score == 0.0:
    # Agent misunderstands task
    # → Focus on function selection and task interpretation

Resources

📊 Live Leaderboard: View partial credit benchmark results
💻 GitHub Repository: Full implementation
🎥 Demo Video: Walkthrough of partial credit evaluation

Migration Guide

For Existing Benchmarks

# Before (ORM - binary scoring)
result = evaluate_agent(task, strict_mode=True)
# Returns: 0.0 or 1.0

# After (PRM - partial credit)
result = evaluate_agent(task, allow_partial=True)
# Returns: 0.0, 0.5, or 1.0

Backward Compatibility

Default behavior remains ORM (binary) for backward compatibility
Explicit allow_partial=True required to enable PRM
All existing test cases continue to work without modification

Technical Details

Evaluation Logic

def calculate_partial_score(predicted, expected):
    """
    Calculate partial credit score for tool call
    
    Returns:
        1.0: Exact match (function + all params)
        0.5: Function match, params incomplete/incorrect
        0.0: Function mismatch or completely wrong
    """
    if predicted.function != expected.function:
        return 0.0
    
    if params_exact_match(predicted.params, expected.params):
        return 1.0
    
    if params_partial_match(predicted.params, expected.params):
        return 0.5
    
    return 0.0

Aggregation for Multi-Tool Tasks

For tasks requiring multiple tool calls:

# Average partial scores across all required tools
total_score = sum(scores) / len(expected_tools)

# Example: 4 tools required, agent gets [1.0, 0.5, 0.5, 0.0]
# Final score: 2.0 / 4 = 0.5

Future Work

Fine-grained partial credit: 0.25 increments for parameter subsets
Weighted parameters: Critical params worth more than optional ones
Temporal credit: Earlier correct attempts weighted higher
Multi-step PRM: Track credit across conversation turns

Contributing

We welcome contributions to expand partial credit evaluation:

Additional scoring granularity
Domain-specific credit rules
New test cases with edge cases
Integration with other benchmarks

random added 3 commits January 29, 2026 21:31

extending to PRM(process-reward-model) with partial rewards

0e492be

added task examples

cbe2bad

adding task splits

df86864

sulbhajain requested a review from victorb-sierra as a code owner January 30, 2026 06:10

sulbhajain changed the title ~~Tau2 partial green~~ Tau2 partial - introduce partial credit scoring for tool calling, enabling Process Reward Models (PRM) over traditional Outcome Reward Models (ORM) Feb 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tau2 partial - introduce partial credit scoring for tool calling, enabling Process Reward Models (PRM) over traditional Outcome Reward Models (ORM)#151

Tau2 partial - introduce partial credit scoring for tool calling, enabling Process Reward Models (PRM) over traditional Outcome Reward Models (ORM)#151
sulbhajain wants to merge 3 commits intosierra-research:mainfrom
sulbhajain:tau2-partial-green

sulbhajain commented Jan 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sulbhajain commented Jan 30, 2026

Partial Rewards for Tool Calling: PRM vs ORM Support

Overview

Motivation

The Problem with Binary Evaluation (ORM)

Why Partial Credit Matters (PRM)

Implementation

Scoring Framework

Usage

Test Coverage

Impact on Benchmarking

Score Differences: ORM vs PRM

Key Improvements

Real-World Alignment

Use Cases

1. Reinforcement Learning with Human Feedback (RLHF)

2. Process Reward Model Training

3. Agent Debugging & Analysis

Resources

Migration Guide

For Existing Benchmarks

Backward Compatibility

Technical Details

Evaluation Logic

Aggregation for Multi-Tool Tasks

Future Work

Contributing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant