-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Description
Problem
/gsd:verify-work extracts tests from SUMMARY.md accomplishments, which describe what was built - not what works after restart. This creates a systematic blind spot for cold-start bugs.
Real-world example
Phase 32 of our project passed 9/9 UAT tests, but shipped a P0 bug:
createTableMs: 50in dynalite caused a race condition where tables weren't ACTIVE when seed ran- Silent
try/catchin server.ts masked the error - Server started green with zero data
- Every UI interaction was broken
Why verify-work missed it: All 9 tests ran against a warm server (already seeded from a prior session). The race only manifests on cold start. 8 of 9 tests were code-review checks (file exists, TypeScript compiles, logic branches correctly). The 1 runtime test reused the already-running server.
Root Cause (3 compounding failures)
- Test extraction only reads SUMMARYs - SUMMARYs describe build claims, not runtime behavior
- No runtime vs code-review distinction - Sub-agents default to reading code, not executing it
- No destructive reset test - No test killed the server, wiped state, and restarted from scratch
Proposed Enhancement
1. Cold-start test auto-injection (highest impact, lowest effort)
In the extract_tests step, pattern-match on modified files:
IF phase modifies any of: [server.ts, database/*, seed/*, index.ts, startup*, config.*]
THEN auto-add test:
name: "Cold Start Smoke Test"
expected: "Kill server, clear state, restart from scratch.
Server boots without errors, seed completes,
primary query returns data."
type: runtime
destructive: true
2. Runtime vs code-review test tagging (medium effort)
Tag each test as type: runtime or type: code-review in the UAT file. Enforce a minimum ratio (e.g., 30% runtime). Sub-agents doing code review should be explicitly told "run the code, don't just read it."
3. Destructive reset before E2E tests (medium effort)
For tests tagged destructive: true, the workflow should kill running services and clear ephemeral state before executing.
Summary
The core insight: verify-work trusts SUMMARYs (what was built) instead of testing reality (what works after restart). This is the AI-agent equivalent of "works on my machine" - the agent's "machine" was a warm server with pre-seeded data.
GSD v1.22.0, reported from a real Phase 32 UAT session