Skip to content

Comments

Head monitor: per-stream recovery and restart on candidate update#8838

Open
jimmygchen wants to merge 6 commits intosigp:unstablefrom
jimmygchen:feat-head-monitor-resilience
Open

Head monitor: per-stream recovery and restart on candidate update#8838
jimmygchen wants to merge 6 commits intosigp:unstablefrom
jimmygchen:feat-head-monitor-resilience

Conversation

@jimmygchen
Copy link
Member

@jimmygchen jimmygchen commented Feb 17, 2026

Description

Adds per-stream recovery to the head monitor. Previously, when a beacon node's SSE stream disconnected, select_all would silently drop it with no way to reconnect — that BN was lost for the lifetime of the monitor. Now disconnected streams are tracked and retried once per slot, giving up only after 5 consecutive cycles with all BNs unreachable.

Also adds a Notify-based restart when update_candidates_list is called, so stale SSE connections are replaced with ones targeting the updated candidate list.

Before:  BN1 ───────────────────►       After:  BN1 ───────────────────►
         BN2 ─x (lost forever)                  BN2 ─x (retry next slot)
         BN3 ───────────────────►               BN3 ───────────────────►

New metrics: vc_head_monitor_stream_disconnections_total, vc_head_monitor_stream_reconnections_total, vc_head_monitor_restarts_total.

Closes #8741

Additional Info

Cached heads are removed on disconnect to keep is_latest accurate. The SelectAll poll is gated on !is_empty() to avoid busy-looping when all streams have ended.

Refactor poll_head_event_from_beacon_nodes to handle individual stream
failures without tearing down all streams. Failed streams are retried
periodically (once per slot) while healthy streams continue operating.

Add a restart signal (tokio::sync::Notify) that fires when the candidate
list is updated via the API, causing the head monitor to cleanly restart
with the new candidate set.

Key changes:
- Per-stream error handling: individual BN disconnects no longer kill all
streams. A StreamEnded sentinel detects which stream failed.
- Periodic retry: failed streams are reconnected every slot duration.
- Restart signal: update_candidates_list() triggers Notify, head monitor
returns Ok(()) for a clean restart with brief delay.
- Consecutive failure limit: after 5 consecutive retries with all BNs
unreachable, returns an error to trigger the next-slot retry.
- Differentiated restart loop: Ok(()) gets 100ms delay, Err gets
next-slot delay.
Prevent CPU spinning when SelectAll is empty by gating the stream branch
with a precondition. Without this, an
empty SelectAll returns Poll::Ready(None) on every select! iteration.

Also clean up failed_indices entries for candidates that have been removed
from the candidate list, preventing unbounded growth and false positives
in the consecutive-failure detection.
@jimmygchen jimmygchen force-pushed the feat-head-monitor-resilience branch from 7d4a4c0 to 990af26 Compare February 20, 2026 03:46
@jimmygchen jimmygchen changed the title Head monitor: per-stream resilience and restart on candidate update Head monitor: per-stream recovery and restart on candidate update Feb 20, 2026
@jimmygchen jimmygchen added val-client Relates to the validator client binary ready-for-review The code is ready for review labels Feb 20, 2026
@jimmygchen jimmygchen marked this pull request as ready for review February 20, 2026 06:51
@jimmygchen jimmygchen removed the ready-for-review The code is ready for review label Feb 20, 2026
Comment on lines 556 to 558
if let Some(cache) = &self.beacon_head_cache {
cache.purge_cache().await;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can probably remove this purge_cache since we are purging already during the notify.

Comment on lines +224 to +229
Ok(event) => {
warn!(event_kind = event.topic_name(), candidate_index, "Unexpected event from BN");
}
Err(e) => {
warn!(error = ?e, node_index = candidate_index, "Head event stream error");
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both of these cases would be pretty unusual to get, do you think it might be worth restarting the stream in these cases?

Although in a case where an SSE endpoint is sending malformed responses, it's not obvious if we should keep retrying the stream vs spamming warn logs.

The benefit of retrying is that we would only ever get 1 warn log per slot, rather than getting 1 warn per event.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

val-client Relates to the validator client binary

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants