Fix TestDelayedEvents/delayed_state_events_are_kept_on_server_restart to account for slow servers (de-flake)#830
Merged
MadLittleMods merged 5 commits intomainfrom Dec 22, 2025
Conversation
1 task
MadLittleMods
commented
Dec 18, 2025
|
|
||
| t.Run("delayed events are empty on startup", func(t *testing.T) { | ||
| matchDelayedEvents(t, user, 0) | ||
| matchDelayedEvents(t, user, delayedEventsNumberEqual(0)) |
Collaborator
Author
There was a problem hiding this comment.
Because I needed some less than/greater than comparisons in the refactored test logic, I decided to make matchDelayedEvents like the MustSyncUntil where we can pass in a bevy of check opts.
MadLittleMods
commented
Dec 18, 2025
Comment on lines
+592
to
+597
| _, err := should.MatchResponse(res, match.HTTPResponse{ | ||
| StatusCode: 200, | ||
| JSON: []match.JSON{ | ||
| match.JSONKeyArrayOfSize("delayed_events", wantNumber), | ||
| }, | ||
| }) |
Collaborator
Author
There was a problem hiding this comment.
Ideally, the other delayedEventsNumberGreaterThan/delayedEventsNumberLessThan checks would also use the JSON matching like we do here so we get maximum nice debug output when things go wrong. But there isn't a matcher for that yet. We could add one but I've let it be for now. Things fit nice enough as-is.
TestDelayedEvents/delayed_state_events_are_kept_on_server_restart to account for slow serversTestDelayedEvents/delayed_state_events_are_kept_on_server_restart to account for slow servers (de-flake)
Conflicts: tests/msc4140/delayed_event_test.go
devonh
approved these changes
Dec 22, 2025
Contributor
devonh
left a comment
There was a problem hiding this comment.
This looks well thought through.
Thanks for taking this on to reduce the overall flakiness of tests.
Co-authored-by: Devon Hudson <devon.dmytro@gmail.com>
Collaborator
Author
|
Thanks for the review @devonh 🐳 |
24 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Follow-up to #829
Part of element-hq/synapse#18537
What does this PR do?
Fix
TestDelayedEvents/delayed_state_events_are_kept_on_server_restartto account for slow servers (de-flake).Previously, this was naively using a single delayed event with a 10 second delay. But because we're stopping and starting servers here, it could take up
deployment.GetConfig().SpawnHSTimeout(defaults to 30 seconds) for the server to start up again so by the time the server is back up, the delayed event may have already been sent, invalidating our assertions below (which expect some delayed events to still be pending and then see one of them be sent after the server is back up).We could account for this by setting the delayed event delay to be longer than
deployment.GetConfig().SpawnHSTimeoutbut that would make the test suite take longer to run in all cases even for homeservers that are quick to restart because we have to wait for that large delay.We instead account for this by scheduling many delayed events at short intervals (we chose 10 seconds because that's what the test naively chose before). Then whenever the servers comes back, we can just check until it decrements by 1.
Experiencing the flaky failure
As experienced when running this test against the worker-based Synapse setup we use alongside the Synapse Pro Rust apps, https://github.com/element-hq/synapse-rust-apps/pull/344#discussion_r2620714306. We probably experience this heavily in the private project because GitHub runners are less than half as powerful as those for public projects and that single container with a share of the 2 CPU cores available is just not powerful enough to run all 16 workers effectively.
For reference, the CI runners provided by GitHub for private projects are less than half as powerful as those for public projects.
And for the same slow reasons, why we're also experiencing this as an occasional flake with
(workers, postgres)in the public Synapse CI as well.Reproduction instructions
I can easily reproduce this problem if I use #827 to limit the number of CPU's available for the homeserver containers to use:
COMPLEMENT_CONTAINER_CPUS=0.5Pull Request Checklist