Head monitor: per-stream recovery and restart on candidate update by jimmygchen · Pull Request #8838 · sigp/lighthouse

jimmygchen · 2026-02-17T23:30:51Z

Description

Adds per-stream recovery to the head monitor. Previously, when a beacon node's SSE stream disconnected, select_all would silently drop it with no way to reconnect — that BN was lost for the lifetime of the monitor. Now disconnected streams are tracked and retried once per slot, giving up only after 5 consecutive cycles with all BNs unreachable.

Also adds a Notify-based restart when update_candidates_list is called, so stale SSE connections are replaced with ones targeting the updated candidate list.

Before:  BN1 ───────────────────►       After:  BN1 ───────────────────►
         BN2 ─x (lost forever)                  BN2 ─x (retry next slot)
         BN3 ───────────────────►               BN3 ───────────────────►

New metrics: vc_head_monitor_stream_disconnections_total, vc_head_monitor_stream_reconnections_total, vc_head_monitor_restarts_total.

Closes #8741

Additional Info

Cached heads are removed on disconnect to keep is_latest accurate. The SelectAll poll is gated on !is_empty() to avoid busy-looping when all streams have ended.

Refactor poll_head_event_from_beacon_nodes to handle individual stream failures without tearing down all streams. Failed streams are retried periodically (once per slot) while healthy streams continue operating. Add a restart signal (tokio::sync::Notify) that fires when the candidate list is updated via the API, causing the head monitor to cleanly restart with the new candidate set. Key changes: - Per-stream error handling: individual BN disconnects no longer kill all streams. A StreamEnded sentinel detects which stream failed. - Periodic retry: failed streams are reconnected every slot duration. - Restart signal: update_candidates_list() triggers Notify, head monitor returns Ok(()) for a clean restart with brief delay. - Consecutive failure limit: after 5 consecutive retries with all BNs unreachable, returns an error to trigger the next-slot retry. - Differentiated restart loop: Ok(()) gets 100ms delay, Err gets next-slot delay.

Prevent CPU spinning when SelectAll is empty by gating the stream branch with a precondition. Without this, an empty SelectAll returns Poll::Ready(None) on every select! iteration. Also clean up failed_indices entries for candidates that have been removed from the candidate list, preventing unbounded growth and false positives in the consecutive-failure detection.

macladson · 2026-02-23T15:27:15Z

validator_client/beacon_node_fallback/src/lib.rs

        if let Some(cache) = &self.beacon_head_cache {
            cache.purge_cache().await;
        }


We can probably remove this purge_cache since we are purging already during the notify.

macladson · 2026-02-23T15:41:20Z

validator_client/beacon_node_fallback/src/beacon_head_monitor.rs

+                            Ok(event) => {
+                                warn!(event_kind = event.topic_name(), candidate_index, "Unexpected event from BN");
+                            }
+                            Err(e) => {
+                                warn!(error = ?e, node_index = candidate_index, "Head event stream error");
+                            }


Both of these cases would be pretty unusual to get, do you think it might be worth restarting the stream in these cases?

Although in a case where an SSE endpoint is sending malformed responses, it's not obvious if we should keep retrying the stream vs spamming warn logs.

The benefit of retrying is that we would only ever get 1 warn log per slot, rather than getting 1 warn per event.

jimmygchen added 4 commits February 18, 2026 10:29

Add missing candidate_index to unexpected event log

12d2249

Clear stale cache on stream disconnect and add head monitor metrics

990af26

jimmygchen force-pushed the feat-head-monitor-resilience branch from 7d4a4c0 to 990af26 Compare February 20, 2026 03:46

jimmygchen added 2 commits February 20, 2026 17:21

Merge branch 'unstable' into feat-head-monitor-resilience

ba1fe84

Fix cargo fmt import ordering

e3c0e54

jimmygchen changed the title ~~Head monitor: per-stream resilience and restart on candidate update~~ Head monitor: per-stream recovery and restart on candidate update Feb 20, 2026

jimmygchen added val-client Relates to the validator client binary ready-for-review The code is ready for review labels Feb 20, 2026

jimmygchen marked this pull request as ready for review February 20, 2026 06:51

jimmygchen removed the ready-for-review The code is ready for review label Feb 20, 2026

macladson reviewed Feb 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Head monitor: per-stream recovery and restart on candidate update#8838

Head monitor: per-stream recovery and restart on candidate update#8838
jimmygchen wants to merge 6 commits intosigp:unstablefrom
jimmygchen:feat-head-monitor-resilience

jimmygchen commented Feb 17, 2026 •

edited

Loading

Uh oh!

macladson Feb 23, 2026

Uh oh!

macladson Feb 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

jimmygchen commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Additional Info

Uh oh!

macladson Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

macladson Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jimmygchen commented Feb 17, 2026 •

edited

Loading