Skip to content

feat(a2a): Add A2A protocol support for remote agent evaluation#143

Open
wuTims wants to merge 4 commits intosierra-research:mainfrom
wuTims:feature/a2a-agent
Open

feat(a2a): Add A2A protocol support for remote agent evaluation#143
wuTims wants to merge 4 commits intosierra-research:mainfrom
wuTims:feature/a2a-agent

Conversation

@wuTims
Copy link

@wuTims wuTims commented Jan 15, 2026

Summary

This PR introduces Agent-to-Agent (A2A) protocol support to tau2-bench, enabling evaluation of remote agents via the A2A protocol. This allows benchmarking of agents deployed as services without requiring local code integration.

Key additions:

  • A2A protocol client with full JSON-RPC 2.0 support
  • A2AAgent implementation for remote agent evaluation
  • CLI integration with --agent-a2a-endpoint flag
  • Comprehensive test suite (103 new tests)

Closes #111
Note: The scope of this initial PR was reduced by removing the tau2-agent implementation. Since there is an existing implementation of "agentified-tau-bench", I thought that adding an a2a-agent in the CLI would be more impactful. The forked tau2-agent implementation using ADK can be added in experiments/ if needed. The full implementation is available at my fork: https://github.com/wuTims/tau2-bench-agent

Design Decisions

1. Protocol Translation Layer

The A2A protocol uses a different message format than tau2's internal representation. Rather than modifying core tau2 data models, we implemented a dedicated translation layer (src/tau2/a2a/translation.py) that:

  • Converts tau2 UserMessage/AssistantMessage to A2A MessageSendParams
  • Extracts tool calls from A2A responses (supports both JSON and markdown code blocks)
  • Formats system context with domain policy and available tools in the first message

2. Stateless Agent Design

A2AAgent maintains conversation state via A2A's context_id mechanism:

  • First message creates a new context on the remote agent
  • Subsequent messages reuse the same context_id
  • State is tracked in A2AAgentState (immutable, returned with each response)

3. Tool Description Injection

Since A2A agents don't receive tools via function calling, we inject tool descriptions as structured text in the system context:

<available_tools>
<tool name="get_user_info">
  <description>Get user information by user ID</description>
  <parameters>{"type": "object", "properties": {"user_id": {"type": "string"}}}</parameters>
</tool>
</available_tools>

4. Backward Compatibility

All existing functionality is preserved:

  • Default agent remains llm_agent
  • No changes to existing CLI workflows
  • A2A-specific arguments are only required when --agent a2a_agent

Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                         tau2 CLI                                     │
│  tau2 run --agent a2a_agent --agent-a2a-endpoint <url>              │
└─────────────────────────────────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────────┐
│                       A2AAgent                                       │
│  src/tau2/agent/a2a_agent.py                                        │
│  - Implements BaseAgent interface                                    │
│  - Manages conversation state via context_id                         │
│  - Delegates to A2AClient for protocol communication                 │
└─────────────────────────────────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────────┐
│                       A2AClient                                      │
│  src/tau2/a2a/client.py                                             │
│  - HTTP client with retry logic                                      │
│  - JSON-RPC 2.0 request/response handling                            │
│  - Agent card discovery                                              │
└─────────────────────────────────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────────┐
│                    Translation Layer                                 │
│  src/tau2/a2a/translation.py                                        │
│  - tau2 messages → A2A MessageSendParams                            │
│  - A2A responses → tau2 AssistantMessage                            │
│  - Tool call extraction (JSON + markdown)                            │
└─────────────────────────────────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────────┐
│                    Remote A2A Agent                                  │
│  (External service implementing A2A protocol)                        │
└─────────────────────────────────────────────────────────────────────┘

New Files

File Purpose
src/tau2/a2a/__init__.py Package exports
src/tau2/a2a/client.py A2A HTTP client with retry logic
src/tau2/a2a/models.py Pydantic models for A2A config and state
src/tau2/a2a/translation.py Message format translation
src/tau2/a2a/exceptions.py A2A-specific exceptions
src/tau2/a2a/metrics.py Performance metrics collection
src/tau2/agent/a2a_agent.py BaseAgent implementation for A2A

Example Usage

Basic A2A Agent Evaluation

tau2 run \
  --domain telecom \
  --task-set-name telecom \
  --agent a2a_agent \
  --agent-a2a-endpoint "https://my-agent.example.com/a2a/agent" \
  --user-llm "gpt-4.1" \
  --num-tasks 5

With Authentication

tau2 run \
  --domain airline \
  --task-set-name airline \
  --agent a2a_agent \
  --agent-a2a-endpoint "https://my-agent.example.com/a2a/agent" \
  --agent-a2a-auth-token "Bearer my-secret-token" \
  --agent-a2a-timeout 600 \
  --num-tasks 10

Full Evaluation Script

#!/bin/bash
# evaluate_a2a_agent.sh

ENDPOINT="https://my-agent.example.com/a2a/agent"
DOMAINS=("telecom" "airline" "retail")

for domain in "${DOMAINS[@]}"; do
  echo "Evaluating $domain domain..."
  tau2 run \
    --domain "$domain" \
    --task-set-name "$domain" \
    --agent a2a_agent \
    --agent-a2a-endpoint "$ENDPOINT" \
    --user-llm "gpt-4.1" \
    --num-tasks 50 \
    --max-concurrency 5 \
    --save-to "a2a_${domain}_eval"
done

echo "View results:"
tau2 view

Test Coverage

Test Suite Tests Coverage
test_a2a_client/test_a2a_agent.py 21 Agent lifecycle, state management
test_a2a_client/test_agent_discovery.py 8 Agent card fetching, fallbacks
test_a2a_client/test_debug_logging.py 4 TRACE-level logging
test_a2a_client/test_message_translation.py 25 Message conversion, tool extraction
test_a2a_client/test_metrics.py 11 Latency/throughput tracking
test_a2a_client/test_metrics_export.py 16 Prometheus/JSON export
test_a2a_client/test_performance.py 10 Concurrent requests, timeouts
test_backward_compatibility/test_cli.py 23 CLI arg parsing, defaults
test_backward_compatibility/test_llm_agent.py 5 LLMAgent unchanged
test_backward_compatibility/test_regression.py 11 Import paths, registry
Total 103

Running Tests

# All A2A tests (mock-based, no network)
pytest tests/test_a2a_client -v -m a2a_mock

# Backward compatibility tests
pytest tests/test_backward_compatibility -v

# All tests
pytest tests/test_a2a_client tests/test_backward_compatibility -v

Verified Evaluation

Successfully tested against a deployed A2A agent:

Domain: telecom
Task: mobile_data_issue (airplane_mode + roaming)
Endpoint: https://simple-gemini-agent-*.run.app/a2a/simple_gemini_agent
Result: Reward 1.0 ✅

- Add A2AClient for HTTP communication with A2A agents
- Add message translation between tau2 and A2A formats
- Add protocol metrics collection for latency/token tracking
- Add custom exceptions for A2A error handling
- Support markdown code block extraction for tool calls
- Include system context injection on first message
- Implement A2AAgent extending LocalAgent interface
- Add async/sync bridge for HTTP operations
- Support context persistence across conversation turns
- Add protocol metrics export methods
- Include from_cli_args factory for CLI integration
- Add --agent-a2a-endpoint, --agent-a2a-auth-token, --agent-a2a-timeout flags
- Register a2a_agent in agent registry
- Add A2AAgent construction in run_task()
- Fix is_tool_call() to check for non-empty tool_calls list
- Add httpx and pytest-asyncio dependencies
- Add mock-based A2A client tests (agent, discovery, translation)
- Add protocol metrics and performance tests
- Add backward compatibility tests for CLI and LLMAgent
- Add regression tests for agent interface stability
@victorb-sierra victorb-sierra added the enhancement New feature or request label Jan 21, 2026
@victorb-sierra victorb-sierra added this to the v3.0 - Tau3 Upgrade milestone Jan 21, 2026
@victorb-sierra
Copy link
Collaborator

Thank you for your PR. Added as a potential enhancement to the tau3 milestone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improving A2A Agent Integration for tau2-bench

2 participants