feat: add new healthcare domain by eliot-gtn · Pull Request #136 · sierra-research/tau2-bench

eliot-gtn · 2026-01-11T14:34:06Z

Summary

This PR introduces a healthcare domain for tau2-bench that tests agents' ability to handle medical workflows requiring strict privacy compliance, clinical safety awareness, and multi-step coordination. The domain evaluates 8 healthcare scenarios across 152 tasks: appointment scheduling, chronic condition monitoring, telehealth setup, prescription refills, test results access, urgent triage, critical triage, and patient mistake handling. Focus areas include workflow adherence, bidirectional agent-patient coordination, and safety-critical decision making.

Changes Made

Core Domain Implementation

New Files:

src/tau2/domains/healthcare/ - Complete domain implementation
- data_model.py, user_data_model.py, environment.py - Core models and environment
- tools.py (18 agent tools), user_tools.py (20 patient tools) - Bidirectional coordination
- README.md - Comprehensive domain documentation

Task Generation:

src/tau2/domains/healthcare/tasks/ - Task generation infrastructure
- evaluation_functions.py - Centralized evaluation logic
- create_tasks.py, manager.py - Task generation pipeline with composition support
- 8 intent-specific task files covering appointments (27), prescriptions (21), chronic monitoring (15), telehealth (18), test results (12), urgent triage (3), critical triage (3), patient mistakes (4)

Data & Policy:

data/tau2/domains/healthcare/ - Domain data and configuration
- db.json - 3 patient records with comprehensive medical histories
- policy.md - Unified agent policy
- tasks.json (70 sampled), tasks_full.json (152 complete), tasks_small.json (37 single-intent)
- split_tasks.json - Train/dev/test splits

Testing:

tests/test_domains/test_healthcare/ - 66 tests, all passing (100% pass rate)

Registry Integration:

src/tau2/registry.py - Healthcare domain and task registration

Key Healthcare Features

HIPAA-Compliant Privacy - Mandatory identity verification before PHI disclosure, protected health information safeguards
Appointment Scheduling Workflows - Extensive coverage of medical scheduling including booking, insurance verification, availability coordination, and calendar management
Clinical Safety Thresholds - Evidence-based vital sign thresholds (fever ≥103°F, BP ≥180/120, glucose <70/>250, O2 <90%) with clear escalation rules
Multi-Step Workflow Patterns - Hierarchical workflow: identity → assessment → verification → action. Tests workflow adherence, not just outcomes.
Bidirectional Tool Coordination - Agent tools (system operations) + patient tools (real-world actions like measuring vitals, checking insurance card)
Mixed Evaluation Strategy - ENV_ASSERTION (outcome validation) + ACTION (workflow verification with tool call history tracking). Both must pass for full credit. Tool call history tracking validates procedures, not just outcomes.

Tested Capabilities:

Aspect	How Healthcare Tests It
Privacy/Security	Identity verification, consent management, authorization checks
Clinical Safety	Vital sign interpretation, urgency assessment, appropriate escalation
Workflow Compliance	Multi-step sequences, prerequisite validation, no shortcuts allowed
Coordination	Bidirectional agent-patient tool usage, calendar synchronization
Error Handling	Patient confusion, missing data, invalid requests

Domain Statistics:

Metric	Value
Tasks	152 full (70 sampled, 37 small)
Intents	8 healthcare scenarios
Patient Personas	3 (Easy, None, Hard)
Tools	18 agent-side, 20 patient-side
Policy	512 lines unified document
Evaluation	Mixed (ENV_ASSERTION + ACTION)
Tests	66 tests (100% pass rate)

Base Framework Enhancement

Behavioral Workflow Verification:
Added compare_args field to ToolCall in src/tau2/data_model/message.py to enable partial argument matching for ACTION evaluation.

Purpose: The existing Action.compare_args mechanism in tasks.py can now match against actual tool calls. This verifies specific arguments (e.g., patient_id, procedure_type) while ignoring non-deterministic ones (e.g., appointment dates/times). Useful for healthcare workflows where exact times vary but procedure sequence matters.

Testing

All 66 tests pass (100% pass rate):

pytest tests/test_domains/test_healthcare/ -v

Manual testing with Claude models verified policy compliance and workflow adherence.

Documentation

src/tau2/domains/healthcare/README.md - Domain overview, architecture, examples, testing guide
data/tau2/domains/healthcare/policy.md - Complete agent policy with clinical thresholds and workflows

Checklist

Tests pass (66/66, 100% pass rate)
Code follows style guidelines
Documentation complete
No breaking changes (compare_args is optional, backward compatible)
Integration verified

Future Development Ideas

Expand Coverage:

New healthcare intents (preventive care, specialist referrals, lab orders)
Complete lifecycle workflows (book → reschedule → cancel)
Multi-step insurance workflows (pre-authorization, claims, appeals)

Diverse Patient Personas: (TauTrait?)

Challenging personas (anxious, non-compliant, health-illiterate)
Cultural/linguistic diversity
Age-specific (pediatric with guardian, elderly with caregiver)
Adversarial personas (privacy testers, malicious users)

Enhanced Patient Data:

Expand patient database beyond current 3 records for greater test diversity
More varied medical histories (multiple chronic conditions, complex medication regimens)
Diverse demographic profiles (age ranges, insurance types, medical backgrounds)

Enhanced Realism:

COMMUNICATE_INFO assertions (verify critical medical information disclosure)
Adversarial security testing (social engineering, impersonation attempts, HIPAA resistance)
Advanced workflows (out-of-network coverage, care team collaboration, multi-patient coordination)

Developed for the AgentBeats competition.
Related to Issue #127
Ideas and feedback welcome!

victorb-sierra · 2026-01-21T08:40:48Z

Thank you for your PR. Do you have more information on how models are performing on this new domain?

feat: add new healthcare domain

969b928

eliot-gtn requested a review from victorb-sierra as a code owner January 11, 2026 14:34

victorb-sierra added the enhancement New feature or request label Jan 21, 2026

victorb-sierra added the new domain Proposal for a new domain label Jan 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add new healthcare domain#136

feat: add new healthcare domain#136
eliot-gtn wants to merge 1 commit intosierra-research:mainfrom
eliot-gtn:domain/healthcare/add-new-healthcare-domain

eliot-gtn commented Jan 11, 2026 •

edited

Loading

Uh oh!

victorb-sierra commented Jan 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

eliot-gtn commented Jan 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes Made

Core Domain Implementation

Key Healthcare Features

Base Framework Enhancement

Testing

Documentation

Checklist

Future Development Ideas

Uh oh!

victorb-sierra commented Jan 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

eliot-gtn commented Jan 11, 2026 •

edited

Loading