Support pre-defined multi-turn conversation history in BenchmarkCase

## Context

BenchmarkCase currently supports single-turn only (`system` + `user` → one response). For ReAct loop evaluation (ADR-012 Phase 1), we need to support pre-defined multi-turn conversation histories so the model responds to the final message with full accumulated context visible.

This enables:
- Archive replay from LAS production traces (serialize first N iterations as context, test model's iteration N+1)
- Context degradation testing without a full loop executor
- Foundation for dynamic ReAct evaluation in later phases

## Changes

- Add optional `messages: list[dict]` field to `BenchmarkCase`
- Update `to_messages()` to return `messages` if present, else fall back to `system` + `user`
- No changes needed to runners, MCP tools, or adapters — they already accept `list[dict]` messages

## Acceptance

- [ ] Tests for `to_messages()` with and without `messages` field
- [ ] Backward compatibility: all existing 331 tests pass
- [ ] JSON battery file with `messages` array loads and runs

Part of ADR-012: ReAct Loop Evaluation (Phase 1 of 6)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support pre-defined multi-turn conversation history in BenchmarkCase #135

Context

Changes

Acceptance

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Support pre-defined multi-turn conversation history in BenchmarkCase #135

Description

Context

Changes

Acceptance

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions