Anchor Pre-PR: unstable, release-v7.0.0 and modularize-validator-store#7127
Merged
jimmygchen merged 49 commits intosigp:anchorfrom Mar 13, 2025
Merged
Anchor Pre-PR: unstable, release-v7.0.0 and modularize-validator-store#7127jimmygchen merged 49 commits intosigp:anchorfrom
unstable, release-v7.0.0 and modularize-validator-store#7127jimmygchen merged 49 commits intosigp:anchorfrom
Conversation
Currently we have very poor coverage of range sync with unit tests. With the event driven test framework we could cover much more ground and be confident when modifying the code. Add two basic cases: - Happy path, complete a finalized sync for 2 epochs - Post-PeerDAS case where we start without enough custody peers and later we find enough⚠️ If you have ideas for more test cases, please let me know! I'll write them
PR to fix cargo audit failure and bump `hickory-proto` version: https://github.com/sigp/lighthouse/actions/runs/13252660830/job/36993664960
Address misc PeerDAS TODOs that are not too big for a dedicated PR I'll justify each TODO on an inlined comment
Currently we track a key metric `PEERS_PER_COLUMN_SUBNET` in a getter `good_peers_on_sampling_subnets`. Another PR sigp#6922 deletes that function, so we have to move the metric anyway. This PR moves that metric computation to the metrics spawned task which is refreshed every 5 seconds. I also added a few more useful metrics. The total set and intended usage is: - `sync_peers_per_column_subnet`: Track health of overall subnets in your node - `sync_peers_per_custody_column_subnet`: Track health of the subnets your node needs. We should track this metric closely in our dashboards with a heatmap and bar plot - ~~`sync_column_subnets_with_zero_peers`: Is equivalent to the Grafana query `count(sync_peers_per_column_subnet == 0) by (instance)`. We may prefer to skip it, but I believe it's the most important metric as if `sync_column_subnets_with_zero_peers > 0` your node stalls.~~ - ~~`sync_custody_column_subnets_with_zero_peers`: `count(sync_peers_per_custody_column_subnet == 0) by (instance)`~~
- PR sigp#6497 made obsolete some consistency checks inside the batch I forgot to remove the consumers of those errors Remove un-used batch sync error condition, which was a nested `Result<_, Result<_, E>>`
Addresses sigp#6854. PeerDAS requires unsubscribing a Gossip topic at a fork boundary. This is not possible with our current topic machinery. Instead of defining which topics have to be **added** at a given fork, we define the complete set of topics at a given fork. The new start of the show and key function is: ```rust pub fn core_topics_to_subscribe<E: EthSpec>( fork_name: ForkName, opts: &TopicConfig, spec: &ChainSpec, ) -> Vec<GossipKind> { // ... if fork_name.deneb_enabled() && !fork_name.fulu_enabled() { // All of deneb blob topics are core topics for i in 0..spec.blob_sidecar_subnet_count(fork_name) { topics.push(GossipKind::BlobSidecar(i)); } } // ... } ``` `core_topics_to_subscribe` only returns the blob topics if `fork < Fulu`. Then at the fork boundary, we subscribe with the new fork digest to `core_topics_to_subscribe(next_fork)`, which excludes the blob topics. I added `is_fork_non_core_topic` to carry on to the next fork the aggregator topics for attestations and sync committee messages. This approach is future-proof if those topics ever become fork-dependent. Closes sigp#6854
Our Holesky nodes running with the light client enabled were logging messages about full queues: > Feb 12 22:09:28.949 ERRO Work queue is full queue: unknown_light_client_optimistic_update, queue_len: 128, msg: the system has insufficient resources for load, service: bproc I thought this might be genuine overload, but it turns out this queue was never being read from! - [x] Rename light-client related queues in the beacon processor for clarity. - [x] Ensure all light-client related queues are being popped from.
Fix a regression introduced in this PR: - sigp#6361 We were indexing into the `MerkleTree` with raw generalized indices, which was incorrect and triggering `debug_assert` failures, as described here: - sigp#7005 - Convert `generalized_index` to the correct leaf index prior to proof generation. - Add sanity checks on indices used in `BeaconState::generate_proof`. - Remove debug asserts from `MerkleTree::generate_proof` in favour of actual errors. This would have caught the bug earlier. - Refactor the EF tests so that the merkle validity tests are actually run. They were misconfigured in a way that resulted in them running silently with 0 test cases, and the `check_all_files_accessed.py` script still had an ignore that covered the test files, so this omission wasn't detected.
We don't need to store `BehaviourAction` for `ready_requests` and therefore avoid having an `unreachable!` on sigp#6625. Therefore this PR should be merged before it
N/A 2 changes: 1. Replace Option::map_or(true, ...) with is_none_or(...) 2. Remove unnecessary `Into::into` blocks where the type conversion is apparent from the types
Resolves sigp#7000 Set the accept header on builder to the correct value when requesting ssz. This PR also adds a flag to disable ssz over the builder api altogether. In the case that builders/relays have an ssz bug, we can react quickly by asking clients to restart their nodes with the `--disable-ssz-builder` flag to force json. I'm not fully convinced if this is useful so open to removing it or opening another PR for it. Testing this currently.
Have blobs by default in deneb runs of kurtosis
I keep being notified for PR's like sigp#7009 where it doesn't touch the specified directories in the `CODEOWNERS` file. After reading the [docs](https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/customizing-your-repository/about-code-owners) not having a forward slash in beginning of the path means: > In this example, @octocat owns any file in an apps directory > anywhere in your repository. whereas with the slashes: > In this example, @doctocat owns any file in the `/docs` > directory in the root of your repository and any of its > subdirectories. this update makes it more rigid for the files in the jxs ownership
PeerDAS has undergone multiple refactors + the blending with the get_blobs optimization has generated technical debt. A function signature like this https://github.com/sigp/lighthouse/blob/f008b84079bbb6eb86de22bb3421dfc8263a5650/beacon_node/beacon_chain/src/beacon_chain.rs#L7171-L7178 Allows at least the following combination of states: - blobs: Some / None - data_columns: Some / None - data_column_recv: Some / None - Block has data? Yes / No - Block post-PeerDAS? Yes / No In reality, we don't have that many possible states, only: - `NoData`: pre-deneb, pre-PeerDAS with 0 blobs or post-PeerDAS with 0 blobs - `Blobs(BlobSidecarList<E>)`: post-Deneb pre-PeerDAS with > 0 blobs - `DataColumns(DataColumnSidecarList<E>)`: post-PeerDAS with > 0 blobs - `DataColumnsRecv(oneshot::Receiver<DataColumnSidecarList<E>>)`: post-PeerDAS with > 0 blobs, but we obtained the columns via reconstruction ^ this are the variants of the new `AvailableBlockData` enum So we go from 2^5 states to 4 well-defined. Downstream code benefits nicely from this clarity and I think it makes the whole feature much more maintainable. Currently `is_available` returns a bool, and then we construct the available block in `make_available`. In a way the availability condition is duplicated in both functions. Instead, this PR constructs `AvailableBlockData` in `is_available` so the availability conditions are written once ```rust if let Some(block_data) = is_available(..) { let available_block = make_available(block_data); } ```
Partly addresses - sigp#6959 Use the `enable_light_client_server` field from the beacon chain config in the HTTP API. I think we can make this the single source of truth, as I think the network crate also has access to the beacon chain config.
…ing (sigp#7030) Related to sigp#6880, an issue that's usually observed on local devnets with small number of nodes. When testing range sync, I usually shutdown a node for some period of time and restart it again. However, if it's within `SYNC_TOLERANCE_EPOCHS` (8), Lighthouse would consider the node as synced, and if it may attempt to produce a block if requested by a validator - on a local devnet, nodes frequently produce blocks - when this happens, the node ends up producing a block that would revert finality and would get disconnected from peers immediately. ### Usage Run Lighthouse BN with this flag to override: ``` --sync-tolerance--epoch 0 ```
…ing (sigp#7030) Related to sigp#6880, an issue that's usually observed on local devnets with small number of nodes. When testing range sync, I usually shutdown a node for some period of time and restart it again. However, if it's within `SYNC_TOLERANCE_EPOCHS` (8), Lighthouse would consider the node as synced, and if it may attempt to produce a block if requested by a validator - on a local devnet, nodes frequently produce blocks - when this happens, the node ends up producing a block that would revert finality and would get disconnected from peers immediately. NOTE: This is PR#7030 cherry-picked from `unstable` to `release-v7.0.0`. Run Lighthouse BN with this flag to override: ``` --sync-tolerance--epoch 0 ```
Cleaned up and isolated version of the `--disable-attesting` flag for the VC, from the `holesky-rescue` branch: - sigp#7041 I figured we don't need the `--disable-attesting` flag on the BN for now, and it was a much more invasive impl.
Adds the `--long-timeouts-multiplier` flag. Allows granular control for VC timeouts which has proved useful in Holesky.
This change makes the `total_difficulty` field in `ExecutionBlock` an `Option<Uint256>` since newer clients are no longer including the `totalDifficulty` field. I think this will fix sigp#6937 but I was actually more focused on the builder registration case described below. In our [builder-playground](https://github.com/flashbots/builder-playground) we setup a local devnet using lighthouse, reth, and mev-boost-relay. After upgrading to reth 1.2.0 and lighthouse v7.0.0.beta.0 for Pectra, we noticed that the validator registration process was _sometimes_ failing with: ``` Feb 25 23:35:25.038 ERRO Unable to publish proposer preparation to all beacon nodes, error: Some endpoints failed, num_failed: 1 http://localhost:3500/ => RequestFailed(ServerMessage(ErrorMessage { code: 400, message: "BAD_REQUEST: error updating proposer preparations: ForkchoiceUpdate(EngineError(Api { error: Json(Error(\"missing field `totalDifficulty`\", line: 0, column: 0)) }))", stacktraces: [] })), service: preparation Feb 25 23:35:25.099 WARN Unable to publish validator registrations to the builder network, error: Some endpoints failed, num_failed: 1 http://localhost:3500/ => RequestFailed(ServerMessage(ErrorMessage { code: 400, message: "BAD_REQUEST: error updating proposer preparations: ForkchoiceUpdate(EngineError(Api { error: Json(Error(\"missing field `totalDifficulty`\", line: 0, column: 0)) }))", stacktraces: [] })), service: preparation ``` What was even more confusing, was that it was sometimes working, which actually led to a wild goose chase thinking it was a networking issue. However, when tracing through the LH code, I came across this comment: https://github.com/sigp/lighthouse/blob/70194dfc6a3f4d10c9059610f889ff5a4e863a6a/beacon_node/beacon_chain/src/beacon_chain.rs#L6048-L6049 This explained why it sometimes worked, in our playground we run lighthouse with `--prepare-payload-lookahead 8000` thus there was always a 4-second window where the call wasn't made. But, if the call was made, then this code would 100% fail with updated reth: https://github.com/sigp/lighthouse/blob/unstable/beacon_node/execution_layer/src/lib.rs#L1688-L1692 Which would then mapped to a `Error::ForkchoiceUpdate` in `update_execution_engine_forkchoice`. Anyways, the fix was to make `total_difficulty` Optional, and then to update any code paths where it was used. In doing so, I assume that if the EL doesn't include total difficulty then the chain is already post-merge.
N/A Derive ssz::Encode and Decode on the `SignedValidatorRegistrationData` type to use in the builder
sigp#7044) Replace the `2 + 2 == 5` hacks from `holesky-rescue` and use the existing `sync_tolerance_epochs` flag to control the proposer prep routines.
This is a backport from `holesky-rescue`. Part of: - sigp#7039 Original PR to `holesky-rescue`: - sigp#7054 Avoid doing database lookups for slots that lie in the hot database when processing status messages. This avoids a DoS vector during non-finality, as loading hot states to iterate block roots is very expensive.
NA Bumps the `ethereum_ssz` version, along with other crates that share the dep. Primarily, this give us bitfields which can store 128 bytes on the stack before allocating, rather than 32 bytes (sigp/ethereum_ssz#38). The validator count has increase massively since we set it at 32 bytes, so aggregation bitfields (et al) now require a heap allocation. This new value of 128 should get us to ~2m active validators.
This updates to a version with the [compute_cells](https://github.com/ethereum/consensus-specs/blob/dev/specs/fulu/polynomial-commitments-sampling.md#compute_cells) method that allows you to extend a blob without creating a proof for it. Which issue # does this PR address? Please list or describe the changes introduced by this PR.
Cherry-picking sigp#7055 from `holesky-rescue` branch to the clean `release-v7.0.0` branch.
Delete duplicate sync tolerance epoch config in the HTTP API which is unused. We introduced the `sync-tolerance-epoch` flag in this PR: - sigp#7030 Then refined it in this PR: - sigp#7044 Somewhere in the merge of `release-v7.0.0` into `unstable`, the config from the original PR which had been deleted came back. I think I resolved these conflicts, so my bad.
We forked `gossipsub` into the lighthouse repo sometime ago so that we could iterate quicker on implementing back pressure and IDONTWANT. Meanwhile we have pushed all our changes upstream and we are now the main maintainers of `rust-libp2p` this allows us to use upstream `gossipsub` again. Nonetheless we still use our forked repo to give us freedom to experiment with features before submitting them upstream
Tracing Integration - [reference](https://github.com/eth-protocol-fellows/cohort-five/blob/5bbf1859e921065bd69f8671038ed16643465b86/projects/project-ideas.md?plain=1#L297) - [x] replace slog & log with tracing throughout the codebase - [x] implement custom crit log - [x] make relevant changes in the formatter - [x] replace sloggers - [x] re-write SSE logging components cc: @macladson @eserilev
Partially sigp#7100 Set blob pruning to default to once per day
Resolves sigp#7091 The `prometheus` crate pulls in `protobuf 2.x` which fails cargo audit. We actually dont use any `protobuf` related features in LH. By disabling default features for `prometheus`, we no longer pull in the `protobuf` crate
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Same as last time: this is a PR for the anchor branch and requires no in-depth review.
This PR resets the branch to the state of unstable and additionally merges in
release-v7.0.0andmodularize-validator-store. We're close to being the same asunstable! 🥳Please merge without squashing!