-
Notifications
You must be signed in to change notification settings - Fork 3
Open
Labels
Description
Summary
The BDD test suite running with --parallel 50 in CI has become flaky, with approximately 12% of CI runs failing (4 failures out of 33 recent runs). Tests pass reliably in isolation locally but fail intermittently under high parallelism.
Affected Scenarios
| Scenario | Failures | Failure Pattern |
|---|---|---|
mcpserver-tool-call-lifecycle |
2 | Timeout at ~31-36s |
mcpserver-streamable-http-tool-call-lifecycle |
1 | Timeout at ~31s |
oauth-sso-state-sync-after-login |
1 | Timeout at ~10.8s |
Common Characteristics
- All failing tests use
wait_for_state: 30sfor polling state transitions - All involve service lifecycle transitions (start/stop/reconnect)
- Tests pass locally in isolation but fail under high parallelism (50 workers)
- Timeout pattern: Tests hit exactly 30s+ which matches the
wait_for_statetimeout
Root Cause Hypotheses
- Resource contention under high parallelism: 50 parallel muster instances competing for CPU/ports
- State transition timing: Service state changes may be slower under CI load
- Mock server startup delays: HTTP/SSE mock servers may take longer to initialize under contention
- Polling interval too slow: 1s poll interval with 30s timeout = only 30 attempts
Evidence
Recent CI failures:
- https://github.com/giantswarm/muster/actions/runs/21341791094 -
mcpserver-streamable-http-tool-call-lifecycle - https://github.com/giantswarm/muster/actions/runs/21341603315 -
oauth-sso-state-sync-after-login - https://github.com/giantswarm/muster/actions/runs/21341509493 -
mcpserver-tool-call-lifecycle - https://github.com/giantswarm/muster/actions/runs/21340267184 -
mcpserver-tool-call-lifecycle
Proposed Investigation
Phase 1: Reliable Reproduction
- Create stress test script to run N iterations locally with 50 parallel workers
- Establish baseline failure rate locally vs CI
- Identify if failures are random or specific to certain scenarios
Phase 2: Instrumentation
- Add timing diagnostics to
muster_manager.go(port allocation, process startup, mock server init) - Add timing diagnostics to
test_runner.go(wait_for_statepolling, state transitions) - Collect detailed logs from failed CI runs
Phase 3: Fixes to Consider
- Increase
wait_for_statetimeout from 30s to 60s for lifecycle tests - Implement exponential backoff in state polling (start at 500ms, backoff to 5s)
- Add startup jitter to prevent thundering herd in parallel execution
- Consider reducing CI parallelism from 50 to 25
Acceptance Criteria
- Test suite failure rate in CI < 1% over 20+ consecutive runs
- No changes to test behavior or coverage
- Root cause documented for future reference
Related
- Test framework:
internal/testing/test_runner.go - Instance manager:
internal/testing/muster_manager.go - CI workflow:
.github/workflows/ci.yaml
Reactions are currently unavailable