feat: Add hospitality domain - Berkeley Hot Pot restaurant simulation by binleiwang · Pull Request #144 · sierra-research/tau2-bench

binleiwang · 2026-01-16T07:28:14Z

Hospitality Domain

A full-service restaurant simulation for evaluating conversational AI agents in high-stakes service environments.

Motivation

The AI in restaurants market is valued at USD 6.1 billion in 2024 and projected to reach USD 48.3 billion by 2033 (CAGR 23.5%) [1]. Voice AI adoption is accelerating, with companies like SoundHound AI now powering over 10,000 restaurant locations for drive-thru and phone ordering [2]. However, current deployments remain limited to low-complexity, transactional interactions — drive-thru orders, phone reservations, and basic FAQs.

Full-service dining presents fundamentally different challenges:

Stateful, multi-turn conversations with emotionally varied customers
Liability-sensitive decisions (food allergies, safety incidents)
Hierarchical authority constraints requiring appropriate escalation
Conflicting objectives (customer satisfaction vs. policy compliance)

This domain addresses the gap between current benchmark coverage and the requirements of full-service hospitality AI.

Grounded in Real Operations: The entire domain is modeled after an actual restaurant operation. This includes:

Seating configuration: Table types, capacities, and expansion limits based on real floor plans
Staff hierarchy: Role definitions, authority levels, and escalation paths reflecting actual restaurant org charts
Menu and pricing: Soup bases, food items, and pricing structures from operational menus
Policies: Discount limits, reservation rules, and service recovery procedures from real staff handbooks
Tasks: All 116 scenarios are derived from actual customer interactions and operational incidents

References:

Dataintelo. "AI in Restaurants Market." 2024. https://dataintelo.com/report/ai-in-restaurants-market
SoundHound AI. "Next-Generation AI Platform for Restaurants." Feb 2025. https://investors.soundhound.com/news-releases/

Overview

This domain simulates a Chinese hot pot restaurant (Berkeley Hot Pot), testing agents on complex interactions that require balancing customer satisfaction with strict operational policies.

Key characteristics:

Multi-turn conversations with emotionally varied customers
Role-based authority limits (Server vs. Manager)
Food safety and liability considerations
Temporal and capacity constraints

Domain Features

High-Stakes Decision Making

Agents must prioritize safety and liability over customer satisfaction. For example, strictly enforcing allergy protocols (e.g., verifying hidden allergens) even when customers insist otherwise.

Hierarchical Reasoning & Escalation

Simulates real-world organizational constraints where agents must judge their authority limits. Agents must decide whether to resolve issues independently (e.g., small discounts) or escalate to management, mirroring role-based access control.

Investigation-First SOPs

Prevents hallucination by requiring agents to verify facts via tools before taking action. For instance, an agent must check the kitchen status or bill details before offering apologies or compensation.

Deterministic Evaluation

All 116 tasks are scored via verifiable actions (tool calls) and database state assertions, ensuring reproducible results without relying on unstable LLM-as-judge metrics.

Task Categories

116 tasks organized by staff role:

Category	Count	Description
`host_phone`	13	Phone reservations, inquiries, complaint calls
`host_seating`	6	Table assignment, party size changes
`host_walkin`	1	Walk-in customer handling
`server_food_safety`	11	Allergy and dietary restriction handling
`server_promotion`	16	Discounts, loyalty points, secret codes
`server_food_issue`	7	Order accuracy, out-of-stock items
`server_billing`	6	Payment and billing inquiries
`server_celebration`	4	Birthday, anniversary coordination
`server_incident`	13	Complaints, accidents, escalations
`server_special_policy`	6	Special amenities and policies
`server_misc`	33	Menu knowledge, seating preferences, misc

Kitchen Coordination Variants

Additional task variants simulate internal operational challenges:

_overload: Kitchen overwhelmed
_understaffed: Short-staffed scenarios
_equipment: Equipment malfunctions
_attitude: Difficult colleague interactions

Agents must handle customer-facing issues without exposing internal problems.

Usage

# Run specific tasks
tau2 run --domain hospitality --task-ids hospitality_007_hidden_allergy --agent-llm gpt-4o

# Run full benchmark
tau2 run --domain hospitality --task-split base --agent-llm gpt-4o

Evaluation

Tasks are evaluated using:

ACTION: Required tool calls (e.g., check_allergy_safety, create_reservation)
ENV_ASSERTION: Database state verification (e.g., assert_escalated_to_manager)

No LLM-as-judge; all evaluations are deterministic.

Model Performance

Baseline results on the 11-task base split:

Model	Pass Rate	Avg Reward	Avg Cost/Conv
GPT-4o-mini	63.6%	0.636	$0.004

Evaluated with --max-concurrency 1 on 2026-01-31.

Files

data/tau2/domains/hospitality/
├── db.json          # Restaurant database (menu, tables, customers)
├── policy.md        # Operational policies (466 lines)
└── tasks.json       # Task definitions (116 tasks)

src/tau2/domains/hospitality/
├── data_model.py    # Pydantic models
├── environment.py   # Environment setup
├── tools.py         # Tool implementations (50+ tools)
└── utils.py         # Helper functions

- 101 deterministic evaluation tasks - Safety protocols (allergy handling, gluten-free policy) - RBAC (Server/Host/Manager authority levels) - Complex policy constraints (promotions, reservations, incidents) - 100% code-based evaluation (no LLM-as-Judge) - Docker support for AgentBeats integration

victorb-sierra · 2026-01-21T08:36:20Z

Thank you for your PR. Do you have more detailed information on how models are performing on this new domain?

…ols and tasks

binleiwang · 2026-02-01T07:59:11Z

Thank you for your PR. Do you have more detailed information on how models are performing on this new domain?

Thanks for the feedback! I've added model performance data to the domain README. GPT-4o-mini achieves 63.6% pass rate on the 11-task base split. Let me know if you need additional details.

binleiwang requested a review from victorb-sierra as a code owner January 16, 2026 07:28

binleiwang added 3 commits January 15, 2026 23:45

docs: Add Docker build instructions to agentify README

b472410

docs: Add Docker instructions to hospitality README

3706ffe

Fix Docker CMD and bind to 0.0.0.0

173a09b

victorb-sierra added enhancement New feature or request new domain Proposal for a new domain labels Jan 21, 2026

binleiwang added 3 commits January 31, 2026 23:16

Fix infinite loop, add conversation efficiency guidelines, improve to…

bb7e477

…ols and tasks

Add pre-built Docker image URL to README

09487ab

Add model performance to hospitality README

c46a470

Refine README features to focus on AI capabilities

43da4c0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add hospitality domain - Berkeley Hot Pot restaurant simulation#144

feat: Add hospitality domain - Berkeley Hot Pot restaurant simulation#144
binleiwang wants to merge 8 commits intosierra-research:mainfrom
binleiwang:main

binleiwang commented Jan 16, 2026 •

edited

Loading

Uh oh!

victorb-sierra commented Jan 21, 2026

Uh oh!

binleiwang commented Feb 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

binleiwang commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Hospitality Domain

Motivation

Overview

Domain Features

High-Stakes Decision Making

Hierarchical Reasoning & Escalation

Investigation-First SOPs

Deterministic Evaluation

Task Categories

Kitchen Coordination Variants

Usage

Evaluation

Model Performance

Files

Uh oh!

victorb-sierra commented Jan 21, 2026

Uh oh!

binleiwang commented Feb 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

binleiwang commented Jan 16, 2026 •

edited

Loading