Skip to content

BDD test suite has ~12% failure rate in CI due to flaky lifecycle tests #309

@teemow

Description

@teemow

Summary

The BDD test suite running with --parallel 50 in CI has become flaky, with approximately 12% of CI runs failing (4 failures out of 33 recent runs). Tests pass reliably in isolation locally but fail intermittently under high parallelism.

Affected Scenarios

Scenario Failures Failure Pattern
mcpserver-tool-call-lifecycle 2 Timeout at ~31-36s
mcpserver-streamable-http-tool-call-lifecycle 1 Timeout at ~31s
oauth-sso-state-sync-after-login 1 Timeout at ~10.8s

Common Characteristics

  1. All failing tests use wait_for_state: 30s for polling state transitions
  2. All involve service lifecycle transitions (start/stop/reconnect)
  3. Tests pass locally in isolation but fail under high parallelism (50 workers)
  4. Timeout pattern: Tests hit exactly 30s+ which matches the wait_for_state timeout

Root Cause Hypotheses

  1. Resource contention under high parallelism: 50 parallel muster instances competing for CPU/ports
  2. State transition timing: Service state changes may be slower under CI load
  3. Mock server startup delays: HTTP/SSE mock servers may take longer to initialize under contention
  4. Polling interval too slow: 1s poll interval with 30s timeout = only 30 attempts

Evidence

Recent CI failures:

Proposed Investigation

Phase 1: Reliable Reproduction

  • Create stress test script to run N iterations locally with 50 parallel workers
  • Establish baseline failure rate locally vs CI
  • Identify if failures are random or specific to certain scenarios

Phase 2: Instrumentation

  • Add timing diagnostics to muster_manager.go (port allocation, process startup, mock server init)
  • Add timing diagnostics to test_runner.go (wait_for_state polling, state transitions)
  • Collect detailed logs from failed CI runs

Phase 3: Fixes to Consider

  • Increase wait_for_state timeout from 30s to 60s for lifecycle tests
  • Implement exponential backoff in state polling (start at 500ms, backoff to 5s)
  • Add startup jitter to prevent thundering herd in parallel execution
  • Consider reducing CI parallelism from 50 to 25

Acceptance Criteria

  • Test suite failure rate in CI < 1% over 20+ consecutive runs
  • No changes to test behavior or coverage
  • Root cause documented for future reference

Related

  • Test framework: internal/testing/test_runner.go
  • Instance manager: internal/testing/muster_manager.go
  • CI workflow: .github/workflows/ci.yaml

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingtesting

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions