feat: Add hospitality domain - Berkeley Hot Pot restaurant simulation#144
Open
binleiwang wants to merge 8 commits intosierra-research:mainfrom
Open
feat: Add hospitality domain - Berkeley Hot Pot restaurant simulation#144binleiwang wants to merge 8 commits intosierra-research:mainfrom
binleiwang wants to merge 8 commits intosierra-research:mainfrom
Conversation
- 101 deterministic evaluation tasks - Safety protocols (allergy handling, gluten-free policy) - RBAC (Server/Host/Manager authority levels) - Complex policy constraints (promotions, reservations, incidents) - 100% code-based evaluation (no LLM-as-Judge) - Docker support for AgentBeats integration
Collaborator
|
Thank you for your PR. Do you have more detailed information on how models are performing on this new domain? |
Author
Thanks for the feedback! I've added model performance data to the domain README. GPT-4o-mini achieves 63.6% pass rate on the 11-task base split. Let me know if you need additional details. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Hospitality Domain
A full-service restaurant simulation for evaluating conversational AI agents in high-stakes service environments.
Motivation
The AI in restaurants market is valued at USD 6.1 billion in 2024 and projected to reach USD 48.3 billion by 2033 (CAGR 23.5%) [1]. Voice AI adoption is accelerating, with companies like SoundHound AI now powering over 10,000 restaurant locations for drive-thru and phone ordering [2]. However, current deployments remain limited to low-complexity, transactional interactions — drive-thru orders, phone reservations, and basic FAQs.
Full-service dining presents fundamentally different challenges:
This domain addresses the gap between current benchmark coverage and the requirements of full-service hospitality AI.
Grounded in Real Operations: The entire domain is modeled after an actual restaurant operation. This includes:
References:
Overview
This domain simulates a Chinese hot pot restaurant (Berkeley Hot Pot), testing agents on complex interactions that require balancing customer satisfaction with strict operational policies.
Key characteristics:
Domain Features
High-Stakes Decision Making
Agents must prioritize safety and liability over customer satisfaction. For example, strictly enforcing allergy protocols (e.g., verifying hidden allergens) even when customers insist otherwise.
Hierarchical Reasoning & Escalation
Simulates real-world organizational constraints where agents must judge their authority limits. Agents must decide whether to resolve issues independently (e.g., small discounts) or escalate to management, mirroring role-based access control.
Investigation-First SOPs
Prevents hallucination by requiring agents to verify facts via tools before taking action. For instance, an agent must check the kitchen status or bill details before offering apologies or compensation.
Deterministic Evaluation
All 116 tasks are scored via verifiable actions (tool calls) and database state assertions, ensuring reproducible results without relying on unstable LLM-as-judge metrics.
Task Categories
116 tasks organized by staff role:
host_phonehost_seatinghost_walkinserver_food_safetyserver_promotionserver_food_issueserver_billingserver_celebrationserver_incidentserver_special_policyserver_miscKitchen Coordination Variants
Additional task variants simulate internal operational challenges:
_overload: Kitchen overwhelmed_understaffed: Short-staffed scenarios_equipment: Equipment malfunctions_attitude: Difficult colleague interactionsAgents must handle customer-facing issues without exposing internal problems.
Usage
Evaluation
Tasks are evaluated using:
check_allergy_safety,create_reservation)assert_escalated_to_manager)No LLM-as-judge; all evaluations are deterministic.
Model Performance
Baseline results on the 11-task
basesplit:Evaluated with
--max-concurrency 1on 2026-01-31.Files