-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Context
BenchmarkCase currently supports single-turn only (system + user → one response). For ReAct loop evaluation (ADR-012 Phase 1), we need to support pre-defined multi-turn conversation histories so the model responds to the final message with full accumulated context visible.
This enables:
- Archive replay from LAS production traces (serialize first N iterations as context, test model's iteration N+1)
- Context degradation testing without a full loop executor
- Foundation for dynamic ReAct evaluation in later phases
Changes
- Add optional
messages: list[dict]field toBenchmarkCase - Update
to_messages()to returnmessagesif present, else fall back tosystem+user - No changes needed to runners, MCP tools, or adapters — they already accept
list[dict]messages
Acceptance
- Tests for
to_messages()with and withoutmessagesfield - Backward compatibility: all existing 331 tests pass
- JSON battery file with
messagesarray loads and runs
Part of ADR-012: ReAct Loop Evaluation (Phase 1 of 6)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels