-
Notifications
You must be signed in to change notification settings - Fork 192
Description
Problem/Goal
We need a domain that enables testing of more nuanced customer service interactions where agents must exercise judgment within policy constraints.
The vacation rental domain (simulating platforms like Airbnb/VRBO) provides an ideal context for this because:
-
Multi-layered decision context: Agents must consider domain policy, host-specific requirements, listing details, and user profile information. They must make judgement calls based on host interests, while still enforcing policy and escalating to human-in-the-loop when necessary.
-
Rich scenario space: With three separate parties (guest, host, domain policy) as well as listing context, the vacation rental domain introduces many diverse interactions and scenarios. Some potential interactions include: booking modifications, property disputes, damage claims, communication mediation, safety concerns, policy interpretation.
Proposed Solution
Phase 1: Research & Policy Foundation
- Research industry-standard policies for vacation rental platforms regarding cancellations, refunds, host obligations, and guest behavior
- Establish a comprehensive domain policy document grounded in real-world practices
- Define cancellation policy tiers (flexible, moderate, firm, strict) with clear refund rules
Phase 2: Baseline Implementation
- Create minimal data schema: users, listings (with host-selected cancellation policies), reservations
- Implement core agent tools:
get_user_details,get_reservation_details,get_listing_details,cancel_reservation,process_refund,transfer_to_human_agents - Implement user tools for tau2 dual tool-use:
get_user_id,get_reservation_id - Design base tasks: test simple cancellation and refund flows to ensure data, tools, and policy are functional
Phase 3: Objective Policy Adherence Tasks (~13 tasks)
Implement baseline tasks:
- Lookup Chain Verification (3 tasks): Tests listing lookup
- Refund Outcome Coverage (4 tasks): Tests each refund type in the policy definition
- Rejection Guards (2 tasks): Validates error handling for invalid operations
- Information Retrieval (2 tasks): Tests disambiguation and user profile lookup
- Exception Flows (2 tasks): Tests policy overrides (free cancellation period, host-initiated cancellations)
Phase 4: Complex Judgment-Based Evaluations (Iteratively introduce tasks)
Once objective baseline is established, iteratively define and implement more complex scenarios:
- Ambiguous documentation claims requiring interpretation
- Partial evidence for major events
- Host vs. guest dispute scenarios (leveraging tau2 dual tool-use)
- User claims that conflict with recorded data
- Mid-trip cancellation scenarios
Impact
New Components:
data/tau2/domains/vacation_rental/- Domain data (db.json, policy.md, tasks.json)src/tau2/domains/vacation_rental/- Domain implementation (tools, wiki)
Framework Components:
- Domain registry (register new domain)
- No changes to core evaluation framework expected
Timeline
Estimated completion: Jan 10th, 2026
Note: Phase 4 will be continually improving and expanding tasks.
- Policy research and foundation - 6hrs
- Minimal data and tool implementation - 2hrs
- Baseline tasks with validation - 6hrs
- Expanded task coverage based on findings - 4 weeks
Dependencies
- No external library dependencies beyond existing framework