Skip to content

Support pre-defined multi-turn conversation history in BenchmarkCase #135

@shanevcantwell

Description

@shanevcantwell

Context

BenchmarkCase currently supports single-turn only (system + user → one response). For ReAct loop evaluation (ADR-012 Phase 1), we need to support pre-defined multi-turn conversation histories so the model responds to the final message with full accumulated context visible.

This enables:

  • Archive replay from LAS production traces (serialize first N iterations as context, test model's iteration N+1)
  • Context degradation testing without a full loop executor
  • Foundation for dynamic ReAct evaluation in later phases

Changes

  • Add optional messages: list[dict] field to BenchmarkCase
  • Update to_messages() to return messages if present, else fall back to system + user
  • No changes needed to runners, MCP tools, or adapters — they already accept list[dict] messages

Acceptance

  • Tests for to_messages() with and without messages field
  • Backward compatibility: all existing 331 tests pass
  • JSON battery file with messages array loads and runs

Part of ADR-012: ReAct Loop Evaluation (Phase 1 of 6)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions