Skip to content

feat: add new healthcare domain#136

Open
eliot-gtn wants to merge 1 commit intosierra-research:mainfrom
eliot-gtn:domain/healthcare/add-new-healthcare-domain
Open

feat: add new healthcare domain#136
eliot-gtn wants to merge 1 commit intosierra-research:mainfrom
eliot-gtn:domain/healthcare/add-new-healthcare-domain

Conversation

@eliot-gtn
Copy link

@eliot-gtn eliot-gtn commented Jan 11, 2026

Summary

This PR introduces a healthcare domain for tau2-bench that tests agents' ability to handle medical workflows requiring strict privacy compliance, clinical safety awareness, and multi-step coordination. The domain evaluates 8 healthcare scenarios across 152 tasks: appointment scheduling, chronic condition monitoring, telehealth setup, prescription refills, test results access, urgent triage, critical triage, and patient mistake handling. Focus areas include workflow adherence, bidirectional agent-patient coordination, and safety-critical decision making.

Changes Made

Core Domain Implementation

New Files:

  • src/tau2/domains/healthcare/ - Complete domain implementation
    • data_model.py, user_data_model.py, environment.py - Core models and environment
    • tools.py (18 agent tools), user_tools.py (20 patient tools) - Bidirectional coordination
    • README.md - Comprehensive domain documentation

Task Generation:

  • src/tau2/domains/healthcare/tasks/ - Task generation infrastructure
    • evaluation_functions.py - Centralized evaluation logic
    • create_tasks.py, manager.py - Task generation pipeline with composition support
    • 8 intent-specific task files covering appointments (27), prescriptions (21), chronic monitoring (15), telehealth (18), test results (12), urgent triage (3), critical triage (3), patient mistakes (4)

Data & Policy:

  • data/tau2/domains/healthcare/ - Domain data and configuration
    • db.json - 3 patient records with comprehensive medical histories
    • policy.md - Unified agent policy
    • tasks.json (70 sampled), tasks_full.json (152 complete), tasks_small.json (37 single-intent)
    • split_tasks.json - Train/dev/test splits

Testing:

  • tests/test_domains/test_healthcare/ - 66 tests, all passing (100% pass rate)

Registry Integration:

  • src/tau2/registry.py - Healthcare domain and task registration

Key Healthcare Features

  1. HIPAA-Compliant Privacy - Mandatory identity verification before PHI disclosure, protected health information safeguards

  2. Appointment Scheduling Workflows - Extensive coverage of medical scheduling including booking, insurance verification, availability coordination, and calendar management

  3. Clinical Safety Thresholds - Evidence-based vital sign thresholds (fever ≥103°F, BP ≥180/120, glucose <70/>250, O2 <90%) with clear escalation rules

  4. Multi-Step Workflow Patterns - Hierarchical workflow: identity → assessment → verification → action. Tests workflow adherence, not just outcomes.

  5. Bidirectional Tool Coordination - Agent tools (system operations) + patient tools (real-world actions like measuring vitals, checking insurance card)

  6. Mixed Evaluation Strategy - ENV_ASSERTION (outcome validation) + ACTION (workflow verification with tool call history tracking). Both must pass for full credit. Tool call history tracking validates procedures, not just outcomes.

Tested Capabilities:

Aspect How Healthcare Tests It
Privacy/Security Identity verification, consent management, authorization checks
Clinical Safety Vital sign interpretation, urgency assessment, appropriate escalation
Workflow Compliance Multi-step sequences, prerequisite validation, no shortcuts allowed
Coordination Bidirectional agent-patient tool usage, calendar synchronization
Error Handling Patient confusion, missing data, invalid requests

Domain Statistics:

Metric Value
Tasks 152 full (70 sampled, 37 small)
Intents 8 healthcare scenarios
Patient Personas 3 (Easy, None, Hard)
Tools 18 agent-side, 20 patient-side
Policy 512 lines unified document
Evaluation Mixed (ENV_ASSERTION + ACTION)
Tests 66 tests (100% pass rate)

Base Framework Enhancement

Behavioral Workflow Verification:
Added compare_args field to ToolCall in src/tau2/data_model/message.py to enable partial argument matching for ACTION evaluation.

Purpose: The existing Action.compare_args mechanism in tasks.py can now match against actual tool calls. This verifies specific arguments (e.g., patient_id, procedure_type) while ignoring non-deterministic ones (e.g., appointment dates/times). Useful for healthcare workflows where exact times vary but procedure sequence matters.

Testing

All 66 tests pass (100% pass rate):

pytest tests/test_domains/test_healthcare/ -v

Manual testing with Claude models verified policy compliance and workflow adherence.

Documentation

  • src/tau2/domains/healthcare/README.md - Domain overview, architecture, examples, testing guide
  • data/tau2/domains/healthcare/policy.md - Complete agent policy with clinical thresholds and workflows

Checklist

  • Tests pass (66/66, 100% pass rate)
  • Code follows style guidelines
  • Documentation complete
  • No breaking changes (compare_args is optional, backward compatible)
  • Integration verified

Future Development Ideas

Expand Coverage:

  • New healthcare intents (preventive care, specialist referrals, lab orders)
  • Complete lifecycle workflows (book → reschedule → cancel)
  • Multi-step insurance workflows (pre-authorization, claims, appeals)

Diverse Patient Personas: (TauTrait?)

  • Challenging personas (anxious, non-compliant, health-illiterate)
  • Cultural/linguistic diversity
  • Age-specific (pediatric with guardian, elderly with caregiver)
  • Adversarial personas (privacy testers, malicious users)

Enhanced Patient Data:

  • Expand patient database beyond current 3 records for greater test diversity
  • More varied medical histories (multiple chronic conditions, complex medication regimens)
  • Diverse demographic profiles (age ranges, insurance types, medical backgrounds)

Enhanced Realism:

  • COMMUNICATE_INFO assertions (verify critical medical information disclosure)
  • Adversarial security testing (social engineering, impersonation attempts, HIPAA resistance)
  • Advanced workflows (out-of-network coverage, care team collaboration, multi-patient coordination)

Developed for the AgentBeats competition.
Related to Issue #127
Ideas and feedback welcome!

@victorb-sierra victorb-sierra added the enhancement New feature or request label Jan 21, 2026
@victorb-sierra
Copy link
Collaborator

Thank you for your PR. Do you have more information on how models are performing on this new domain?

@victorb-sierra victorb-sierra added the new domain Proposal for a new domain label Jan 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request new domain Proposal for a new domain

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants