-
Notifications
You must be signed in to change notification settings - Fork 212
chore(consensus): read EL state with a backoff and emit useful syncing info on failure #2494
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary
This PR enhances retry logic and observability for validator config reads:
Changes:
- Replaces fixed 1-second retry delay with linear backoff (1s to 30s)
- Adds network sync status and block height telemetry to help diagnose failures
- Removes escalation from
warn!toerror!after 10 attempts (now alwayswarn!) - Adds utility functions
display_resultanddisplay_optionfor cleaner telemetry formatting
Additional Notes:
- The new telemetry fields (
is_syncing,best_block,target_block,blocks_behind) provide valuable context for debugging validator config read failures, especially during node sync - The
display_resultanddisplay_optionutilities are well-implemented with proper error chain handling - Removing the warn-to-error escalation reduces log severity for persistent failures; consider whether this aligns with alerting requirements
| best_block = %tempo_telemetry_util::display_result(&best_block), | ||
| target_block, | ||
| blocks_behind = %tempo_telemetry_util::display_option(&blocks_behind), | ||
| "reading validator config from contract failed; will retry", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i feel like this might scare operators, maybe we can add to the msg that this is expected when we're behind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
even though is_syncing is in there, i feel like it might be missed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed, or separating the two cases into different logs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, right now it emits errors for hours, so this is an improvement.
WARN is appropriate because it is being handled, yet does constitute a hickup in normal operation and should be looked into.
We already have a log for this (from reth), and it is easily missed. That's why this info is added here.
Separating this into separate events feels incorrect because this is just one event.
This patch implements a constant backoff from 1s to 30s when reading the validator config from execution layer state at the end of an epoch. Spamming this every second was excessive, especially while a node was syncing.
The event level was also lowered to WARN - after all, the node is actively trying to recover.
Finally, the event provides extra information about the execution layer, which potentially helps understand why it currently fails to read state.