-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Fix gRPC subscription retry count to reset on reconnection (frontport of #5380) #5402
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix gRPC subscription retry count to reset on reconnection (frontport of #5380) #5402
Conversation
## Motivation Long-lived gRPC subscription streams would silently terminate after accumulating `max_retries` errors over time, even when each individual reconnection was successful. This caused workers to stop receiving notifications and go silent. ## Proposal Change `retry_count` from a local variable to a shared `Arc<AtomicU32>` that can be reset from within the `unfold` closure when `client.subscribe()` succeeds. Previously, `retry_count` only reset when a successful notification was received. Now it also resets when a reconnection succeeds, so `max_retries` applies to consecutive failed reconnection attempts rather than total errors over the stream's lifetime. ## Test Plan - CI - Manual testing with PM benchmark that was experiencing the issue
This stack of pull requests is managed by Graphite. Learn more about stacking. |
| Ok(response) => { | ||
| // Reset retry count on successful reconnection. | ||
| retry_count.store(0, Ordering::Relaxed); | ||
| trace!("Successfully reconnected subscription stream"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it say "reconnected" now also on the first attempt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, AFAIU the first attempt is when we initially create the stream on L299. This is just for reconnections
| Ok(response) => future::Either::Right(response.into_inner()), | ||
| Ok(response) => { | ||
| // Reset retry count on successful reconnection. | ||
| retry_count.store(0, Ordering::Relaxed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't the store below already take care of that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not if it's an idle stream! And I think this was the bug. If we have a busy stream with notifications coming in all the time, then what we were doing below was probably enough. But if the stream is idle, we can reconnect even if there's no new notifications flowing through. We also need to make sure we zero the retry count in that case.

Motivation
Long-lived gRPC subscription streams would silently terminate after
accumulating
max_retrieserrors over time, evenwhen each individual reconnection was successful. This caused workers to
stop receiving notifications and go silent.
Proposal
Change
retry_countfrom a local variable to a sharedArc<AtomicU32>that can be reset from within the
unfoldclosure when
client.subscribe()succeeds.Previously,
retry_countonly reset when a successful notification wasreceived. Now it also resets when a
reconnection succeeds, so
max_retriesapplies to consecutive failedreconnection attempts rather than total errors
over the stream's lifetime.
Test Plan