feat(a2a): Add A2A protocol support for remote agent evaluation by wuTims · Pull Request #143 · sierra-research/tau2-bench

wuTims · 2026-01-15T11:30:17Z

Summary

This PR introduces Agent-to-Agent (A2A) protocol support to tau2-bench, enabling evaluation of remote agents via the A2A protocol. This allows benchmarking of agents deployed as services without requiring local code integration.

Key additions:

A2A protocol client with full JSON-RPC 2.0 support
A2AAgent implementation for remote agent evaluation
CLI integration with --agent-a2a-endpoint flag
Comprehensive test suite (103 new tests)

Closes #111
Note: The scope of this initial PR was reduced by removing the tau2-agent implementation. Since there is an existing implementation of "agentified-tau-bench", I thought that adding an a2a-agent in the CLI would be more impactful. The forked tau2-agent implementation using ADK can be added in experiments/ if needed. The full implementation is available at my fork: https://github.com/wuTims/tau2-bench-agent

Design Decisions

1. Protocol Translation Layer

The A2A protocol uses a different message format than tau2's internal representation. Rather than modifying core tau2 data models, we implemented a dedicated translation layer (src/tau2/a2a/translation.py) that:

Converts tau2 UserMessage/AssistantMessage to A2A MessageSendParams
Extracts tool calls from A2A responses (supports both JSON and markdown code blocks)
Formats system context with domain policy and available tools in the first message

2. Stateless Agent Design

A2AAgent maintains conversation state via A2A's context_id mechanism:

First message creates a new context on the remote agent
Subsequent messages reuse the same context_id
State is tracked in A2AAgentState (immutable, returned with each response)

3. Tool Description Injection

Since A2A agents don't receive tools via function calling, we inject tool descriptions as structured text in the system context:

<available_tools>
<tool name="get_user_info">
  <description>Get user information by user ID</description>
  <parameters>{"type": "object", "properties": {"user_id": {"type": "string"}}}</parameters>
</tool>
</available_tools>

4. Backward Compatibility

All existing functionality is preserved:

Default agent remains llm_agent
No changes to existing CLI workflows
A2A-specific arguments are only required when --agent a2a_agent

Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                         tau2 CLI                                     │
│  tau2 run --agent a2a_agent --agent-a2a-endpoint <url>              │
└─────────────────────────────────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────────┐
│                       A2AAgent                                       │
│  src/tau2/agent/a2a_agent.py                                        │
│  - Implements BaseAgent interface                                    │
│  - Manages conversation state via context_id                         │
│  - Delegates to A2AClient for protocol communication                 │
└─────────────────────────────────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────────┐
│                       A2AClient                                      │
│  src/tau2/a2a/client.py                                             │
│  - HTTP client with retry logic                                      │
│  - JSON-RPC 2.0 request/response handling                            │
│  - Agent card discovery                                              │
└─────────────────────────────────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────────┐
│                    Translation Layer                                 │
│  src/tau2/a2a/translation.py                                        │
│  - tau2 messages → A2A MessageSendParams                            │
│  - A2A responses → tau2 AssistantMessage                            │
│  - Tool call extraction (JSON + markdown)                            │
└─────────────────────────────────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────────┐
│                    Remote A2A Agent                                  │
│  (External service implementing A2A protocol)                        │
└─────────────────────────────────────────────────────────────────────┘

New Files

File	Purpose
`src/tau2/a2a/__init__.py`	Package exports
`src/tau2/a2a/client.py`	A2A HTTP client with retry logic
`src/tau2/a2a/models.py`	Pydantic models for A2A config and state
`src/tau2/a2a/translation.py`	Message format translation
`src/tau2/a2a/exceptions.py`	A2A-specific exceptions
`src/tau2/a2a/metrics.py`	Performance metrics collection
`src/tau2/agent/a2a_agent.py`	BaseAgent implementation for A2A

Example Usage

Basic A2A Agent Evaluation

tau2 run \
  --domain telecom \
  --task-set-name telecom \
  --agent a2a_agent \
  --agent-a2a-endpoint "https://my-agent.example.com/a2a/agent" \
  --user-llm "gpt-4.1" \
  --num-tasks 5

With Authentication

tau2 run \
  --domain airline \
  --task-set-name airline \
  --agent a2a_agent \
  --agent-a2a-endpoint "https://my-agent.example.com/a2a/agent" \
  --agent-a2a-auth-token "Bearer my-secret-token" \
  --agent-a2a-timeout 600 \
  --num-tasks 10

Full Evaluation Script

#!/bin/bash
# evaluate_a2a_agent.sh

ENDPOINT="https://my-agent.example.com/a2a/agent"
DOMAINS=("telecom" "airline" "retail")

for domain in "${DOMAINS[@]}"; do
  echo "Evaluating $domain domain..."
  tau2 run \
    --domain "$domain" \
    --task-set-name "$domain" \
    --agent a2a_agent \
    --agent-a2a-endpoint "$ENDPOINT" \
    --user-llm "gpt-4.1" \
    --num-tasks 50 \
    --max-concurrency 5 \
    --save-to "a2a_${domain}_eval"
done

echo "View results:"
tau2 view

Test Coverage

Test Suite	Tests	Coverage
`test_a2a_client/test_a2a_agent.py`	21	Agent lifecycle, state management
`test_a2a_client/test_agent_discovery.py`	8	Agent card fetching, fallbacks
`test_a2a_client/test_debug_logging.py`	4	TRACE-level logging
`test_a2a_client/test_message_translation.py`	25	Message conversion, tool extraction
`test_a2a_client/test_metrics.py`	11	Latency/throughput tracking
`test_a2a_client/test_metrics_export.py`	16	Prometheus/JSON export
`test_a2a_client/test_performance.py`	10	Concurrent requests, timeouts
`test_backward_compatibility/test_cli.py`	23	CLI arg parsing, defaults
`test_backward_compatibility/test_llm_agent.py`	5	LLMAgent unchanged
`test_backward_compatibility/test_regression.py`	11	Import paths, registry
Total	103

Running Tests

# All A2A tests (mock-based, no network)
pytest tests/test_a2a_client -v -m a2a_mock

# Backward compatibility tests
pytest tests/test_backward_compatibility -v

# All tests
pytest tests/test_a2a_client tests/test_backward_compatibility -v

Verified Evaluation

Successfully tested against a deployed A2A agent:

Domain: telecom
Task: mobile_data_issue (airplane_mode + roaming)
Endpoint: https://simple-gemini-agent-*.run.app/a2a/simple_gemini_agent
Result: Reward 1.0 ✅

- Add A2AClient for HTTP communication with A2A agents - Add message translation between tau2 and A2A formats - Add protocol metrics collection for latency/token tracking - Add custom exceptions for A2A error handling - Support markdown code block extraction for tool calls - Include system context injection on first message

- Implement A2AAgent extending LocalAgent interface - Add async/sync bridge for HTTP operations - Support context persistence across conversation turns - Add protocol metrics export methods - Include from_cli_args factory for CLI integration

- Add --agent-a2a-endpoint, --agent-a2a-auth-token, --agent-a2a-timeout flags - Register a2a_agent in agent registry - Add A2AAgent construction in run_task() - Fix is_tool_call() to check for non-empty tool_calls list - Add httpx and pytest-asyncio dependencies

- Add mock-based A2A client tests (agent, discovery, translation) - Add protocol metrics and performance tests - Add backward compatibility tests for CLI and LLMAgent - Add regression tests for agent interface stability

victorb-sierra · 2026-01-21T08:33:16Z

Thank you for your PR. Added as a potential enhancement to the tau3 milestone.

wuTims added 4 commits January 15, 2026 10:56

test(a2a): add A2A client and backward compatibility tests

72414f6

- Add mock-based A2A client tests (agent, discovery, translation) - Add protocol metrics and performance tests - Add backward compatibility tests for CLI and LLMAgent - Add regression tests for agent interface stability

wuTims requested a review from victorb-sierra as a code owner January 15, 2026 11:30

victorb-sierra assigned victorb-sierra and unassigned victorb-sierra Jan 21, 2026

victorb-sierra added the enhancement New feature or request label Jan 21, 2026

victorb-sierra added this to the v3.0 - Tau3 Upgrade milestone Jan 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(a2a): Add A2A protocol support for remote agent evaluation#143

feat(a2a): Add A2A protocol support for remote agent evaluation#143
wuTims wants to merge 4 commits intosierra-research:mainfrom
wuTims:feature/a2a-agent

wuTims commented Jan 15, 2026 •

edited

Loading

Uh oh!

victorb-sierra commented Jan 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wuTims commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Design Decisions

1. Protocol Translation Layer

2. Stateless Agent Design

3. Tool Description Injection

4. Backward Compatibility

Architecture

New Files

Example Usage

Basic A2A Agent Evaluation

With Authentication

Full Evaluation Script

Test Coverage

Running Tests

Verified Evaluation

Uh oh!

victorb-sierra commented Jan 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wuTims commented Jan 15, 2026 •

edited

Loading