Head monitor: per-stream recovery and restart on candidate update#8838
Open
jimmygchen wants to merge 6 commits intosigp:unstablefrom
Open
Head monitor: per-stream recovery and restart on candidate update#8838jimmygchen wants to merge 6 commits intosigp:unstablefrom
jimmygchen wants to merge 6 commits intosigp:unstablefrom
Conversation
Refactor poll_head_event_from_beacon_nodes to handle individual stream failures without tearing down all streams. Failed streams are retried periodically (once per slot) while healthy streams continue operating. Add a restart signal (tokio::sync::Notify) that fires when the candidate list is updated via the API, causing the head monitor to cleanly restart with the new candidate set. Key changes: - Per-stream error handling: individual BN disconnects no longer kill all streams. A StreamEnded sentinel detects which stream failed. - Periodic retry: failed streams are reconnected every slot duration. - Restart signal: update_candidates_list() triggers Notify, head monitor returns Ok(()) for a clean restart with brief delay. - Consecutive failure limit: after 5 consecutive retries with all BNs unreachable, returns an error to trigger the next-slot retry. - Differentiated restart loop: Ok(()) gets 100ms delay, Err gets next-slot delay.
Prevent CPU spinning when SelectAll is empty by gating the stream branch with a precondition. Without this, an empty SelectAll returns Poll::Ready(None) on every select! iteration. Also clean up failed_indices entries for candidates that have been removed from the candidate list, preventing unbounded growth and false positives in the consecutive-failure detection.
7d4a4c0 to
990af26
Compare
macladson
reviewed
Feb 23, 2026
Comment on lines
556
to
558
| if let Some(cache) = &self.beacon_head_cache { | ||
| cache.purge_cache().await; | ||
| } |
Member
There was a problem hiding this comment.
We can probably remove this purge_cache since we are purging already during the notify.
Comment on lines
+224
to
+229
| Ok(event) => { | ||
| warn!(event_kind = event.topic_name(), candidate_index, "Unexpected event from BN"); | ||
| } | ||
| Err(e) => { | ||
| warn!(error = ?e, node_index = candidate_index, "Head event stream error"); | ||
| } |
Member
There was a problem hiding this comment.
Both of these cases would be pretty unusual to get, do you think it might be worth restarting the stream in these cases?
Although in a case where an SSE endpoint is sending malformed responses, it's not obvious if we should keep retrying the stream vs spamming warn logs.
The benefit of retrying is that we would only ever get 1 warn log per slot, rather than getting 1 warn per event.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Adds per-stream recovery to the head monitor. Previously, when a beacon node's SSE stream disconnected,
select_allwould silently drop it with no way to reconnect — that BN was lost for the lifetime of the monitor. Now disconnected streams are tracked and retried once per slot, giving up only after 5 consecutive cycles with all BNs unreachable.Also adds a
Notify-based restart whenupdate_candidates_listis called, so stale SSE connections are replaced with ones targeting the updated candidate list.New metrics:
vc_head_monitor_stream_disconnections_total,vc_head_monitor_stream_reconnections_total,vc_head_monitor_restarts_total.Closes #8741
Additional Info
Cached heads are removed on disconnect to keep
is_latestaccurate. TheSelectAllpoll is gated on!is_empty()to avoid busy-looping when all streams have ended.