Anchor Pre-PR: `unstable`, `release-v7.0.0` and `modularize-validator-store` by dknopik · Pull Request #7127 · sigp/lighthouse

dknopik · 2025-03-13T12:20:44Z

Same as last time: this is a PR for the anchor branch and requires no in-depth review.

This PR resets the branch to the state of unstable and additionally merges in release-v7.0.0 and modularize-validator-store. We're close to being the same as unstable! 🥳

Please merge without squashing!

Currently we have very poor coverage of range sync with unit tests. With the event driven test framework we could cover much more ground and be confident when modifying the code. Add two basic cases: - Happy path, complete a finalized sync for 2 epochs - Post-PeerDAS case where we start without enough custody peers and later we find enough ⚠️ If you have ideas for more test cases, please let me know! I'll write them

PR to fix cargo audit failure and bump `hickory-proto` version: https://github.com/sigp/lighthouse/actions/runs/13252660830/job/36993664960

…able

Address misc PeerDAS TODOs that are not too big for a dedicated PR I'll justify each TODO on an inlined comment

Currently we track a key metric `PEERS_PER_COLUMN_SUBNET` in a getter `good_peers_on_sampling_subnets`. Another PR sigp#6922 deletes that function, so we have to move the metric anyway. This PR moves that metric computation to the metrics spawned task which is refreshed every 5 seconds. I also added a few more useful metrics. The total set and intended usage is: - `sync_peers_per_column_subnet`: Track health of overall subnets in your node - `sync_peers_per_custody_column_subnet`: Track health of the subnets your node needs. We should track this metric closely in our dashboards with a heatmap and bar plot - ~~`sync_column_subnets_with_zero_peers`: Is equivalent to the Grafana query `count(sync_peers_per_column_subnet == 0) by (instance)`. We may prefer to skip it, but I believe it's the most important metric as if `sync_column_subnets_with_zero_peers > 0` your node stalls.~~ - ~~`sync_custody_column_subnets_with_zero_peers`: `count(sync_peers_per_custody_column_subnet == 0) by (instance)`~~

- PR sigp#6497 made obsolete some consistency checks inside the batch I forgot to remove the consumers of those errors Remove un-used batch sync error condition, which was a nested `Result<_, Result<_, E>>`

Addresses sigp#6854. PeerDAS requires unsubscribing a Gossip topic at a fork boundary. This is not possible with our current topic machinery. Instead of defining which topics have to be **added** at a given fork, we define the complete set of topics at a given fork. The new start of the show and key function is: ```rust pub fn core_topics_to_subscribe<E: EthSpec>( fork_name: ForkName, opts: &TopicConfig, spec: &ChainSpec, ) -> Vec<GossipKind> { // ... if fork_name.deneb_enabled() && !fork_name.fulu_enabled() { // All of deneb blob topics are core topics for i in 0..spec.blob_sidecar_subnet_count(fork_name) { topics.push(GossipKind::BlobSidecar(i)); } } // ... } ``` `core_topics_to_subscribe` only returns the blob topics if `fork < Fulu`. Then at the fork boundary, we subscribe with the new fork digest to `core_topics_to_subscribe(next_fork)`, which excludes the blob topics. I added `is_fork_non_core_topic` to carry on to the next fork the aggregator topics for attestations and sync committee messages. This approach is future-proof if those topics ever become fork-dependent. Closes sigp#6854

…le (sigp#6984) Closes sigp#6983 `GET v2/validator/aggregate_attestation` is not backwards compatible. It only works for post electra attestations. This PR adds backwards compatibility and additional test coverage. We should include this in the upcoming 7.0 beta release if possible

Our Holesky nodes running with the light client enabled were logging messages about full queues: > Feb 12 22:09:28.949 ERRO Work queue is full queue: unknown_light_client_optimistic_update, queue_len: 128, msg: the system has insufficient resources for load, service: bproc I thought this might be genuine overload, but it turns out this queue was never being read from! - [x] Rename light-client related queues in the beacon processor for clarity. - [x] Ensure all light-client related queues are being popped from.

New release for Electra on Holesky and Sepolia. Includes PRs: - sigp#6808 - sigp#6914 - sigp#6949 - sigp#6950 - sigp#6958

…able

Fix a regression introduced in this PR: - sigp#6361 We were indexing into the `MerkleTree` with raw generalized indices, which was incorrect and triggering `debug_assert` failures, as described here: - sigp#7005 - Convert `generalized_index` to the correct leaf index prior to proof generation. - Add sanity checks on indices used in `BeaconState::generate_proof`. - Remove debug asserts from `MerkleTree::generate_proof` in favour of actual errors. This would have caught the bug earlier. - Refactor the EF tests so that the merkle validity tests are actually run. They were misconfigured in a way that resulted in them running silently with 0 test cases, and the `check_all_files_accessed.py` script still had an ignore that covered the test files, so this omission wasn't detected.

We don't need to store `BehaviourAction` for `ready_requests` and therefore avoid having an `unreachable!` on sigp#6625. Therefore this PR should be merged before it

N/A 2 changes: 1. Replace Option::map_or(true, ...) with is_none_or(...) 2. Remove unnecessary `Into::into` blocks where the type conversion is apparent from the types

Resolves sigp#7000 Set the accept header on builder to the correct value when requesting ssz. This PR also adds a flag to disable ssz over the builder api altogether. In the case that builders/relays have an ssz bug, we can react quickly by asking clients to restart their nodes with the `--disable-ssz-builder` flag to force json. I'm not fully convinced if this is useful so open to removing it or opening another PR for it. Testing this currently.

Have blobs by default in deneb runs of kurtosis

@octocat

I keep being notified for PR's like sigp#7009 where it doesn't touch the specified directories in the `CODEOWNERS` file. After reading the [docs](https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/customizing-your-repository/about-code-owners) not having a forward slash in beginning of the path means: > In this example, @octocat owns any file in an apps directory > anywhere in your repository. whereas with the slashes: > In this example, @doctocat owns any file in the `/docs` > directory in the root of your repository and any of its > subdirectories. this update makes it more rigid for the files in the jxs ownership

PeerDAS has undergone multiple refactors + the blending with the get_blobs optimization has generated technical debt. A function signature like this https://github.com/sigp/lighthouse/blob/f008b84079bbb6eb86de22bb3421dfc8263a5650/beacon_node/beacon_chain/src/beacon_chain.rs#L7171-L7178 Allows at least the following combination of states: - blobs: Some / None - data_columns: Some / None - data_column_recv: Some / None - Block has data? Yes / No - Block post-PeerDAS? Yes / No In reality, we don't have that many possible states, only: - `NoData`: pre-deneb, pre-PeerDAS with 0 blobs or post-PeerDAS with 0 blobs - `Blobs(BlobSidecarList<E>)`: post-Deneb pre-PeerDAS with > 0 blobs - `DataColumns(DataColumnSidecarList<E>)`: post-PeerDAS with > 0 blobs - `DataColumnsRecv(oneshot::Receiver<DataColumnSidecarList<E>>)`: post-PeerDAS with > 0 blobs, but we obtained the columns via reconstruction ^ this are the variants of the new `AvailableBlockData` enum So we go from 2^5 states to 4 well-defined. Downstream code benefits nicely from this clarity and I think it makes the whole feature much more maintainable. Currently `is_available` returns a bool, and then we construct the available block in `make_available`. In a way the availability condition is duplicated in both functions. Instead, this PR constructs `AvailableBlockData` in `is_available` so the availability conditions are written once ```rust if let Some(block_data) = is_available(..) { let available_block = make_available(block_data); } ```

…rDAS devnets (sigp#6947) Closes sigp#6919

Partly addresses - sigp#6959 Use the `enable_light_client_server` field from the beacon chain config in the HTTP API. I think we can make this the single source of truth, as I think the network crate also has access to the beacon chain config.

…ing (sigp#7030) Related to sigp#6880, an issue that's usually observed on local devnets with small number of nodes. When testing range sync, I usually shutdown a node for some period of time and restart it again. However, if it's within `SYNC_TOLERANCE_EPOCHS` (8), Lighthouse would consider the node as synced, and if it may attempt to produce a block if requested by a validator - on a local devnet, nodes frequently produce blocks - when this happens, the node ends up producing a block that would revert finality and would get disconnected from peers immediately. ### Usage Run Lighthouse BN with this flag to override: ``` --sync-tolerance--epoch 0 ```

…ing (sigp#7030) Related to sigp#6880, an issue that's usually observed on local devnets with small number of nodes. When testing range sync, I usually shutdown a node for some period of time and restart it again. However, if it's within `SYNC_TOLERANCE_EPOCHS` (8), Lighthouse would consider the node as synced, and if it may attempt to produce a block if requested by a validator - on a local devnet, nodes frequently produce blocks - when this happens, the node ends up producing a block that would revert finality and would get disconnected from peers immediately. NOTE: This is PR#7030 cherry-picked from `unstable` to `release-v7.0.0`. Run Lighthouse BN with this flag to override: ``` --sync-tolerance--epoch 0 ```

Cleaned up and isolated version of the `--disable-attesting` flag for the VC, from the `holesky-rescue` branch: - sigp#7041 I figured we don't need the `--disable-attesting` flag on the BN for now, and it was a much more invasive impl.

Adds the `--long-timeouts-multiplier` flag. Allows granular control for VC timeouts which has proved useful in Holesky.

This change makes the `total_difficulty` field in `ExecutionBlock` an `Option<Uint256>` since newer clients are no longer including the `totalDifficulty` field. I think this will fix sigp#6937 but I was actually more focused on the builder registration case described below. In our [builder-playground](https://github.com/flashbots/builder-playground) we setup a local devnet using lighthouse, reth, and mev-boost-relay. After upgrading to reth 1.2.0 and lighthouse v7.0.0.beta.0 for Pectra, we noticed that the validator registration process was _sometimes_ failing with: ``` Feb 25 23:35:25.038 ERRO Unable to publish proposer preparation to all beacon nodes, error: Some endpoints failed, num_failed: 1 http://localhost:3500/ => RequestFailed(ServerMessage(ErrorMessage { code: 400, message: "BAD_REQUEST: error updating proposer preparations: ForkchoiceUpdate(EngineError(Api { error: Json(Error(\"missing field `totalDifficulty`\", line: 0, column: 0)) }))", stacktraces: [] })), service: preparation Feb 25 23:35:25.099 WARN Unable to publish validator registrations to the builder network, error: Some endpoints failed, num_failed: 1 http://localhost:3500/ => RequestFailed(ServerMessage(ErrorMessage { code: 400, message: "BAD_REQUEST: error updating proposer preparations: ForkchoiceUpdate(EngineError(Api { error: Json(Error(\"missing field `totalDifficulty`\", line: 0, column: 0)) }))", stacktraces: [] })), service: preparation ``` What was even more confusing, was that it was sometimes working, which actually led to a wild goose chase thinking it was a networking issue. However, when tracing through the LH code, I came across this comment: https://github.com/sigp/lighthouse/blob/70194dfc6a3f4d10c9059610f889ff5a4e863a6a/beacon_node/beacon_chain/src/beacon_chain.rs#L6048-L6049 This explained why it sometimes worked, in our playground we run lighthouse with `--prepare-payload-lookahead 8000` thus there was always a 4-second window where the call wasn't made. But, if the call was made, then this code would 100% fail with updated reth: https://github.com/sigp/lighthouse/blob/unstable/beacon_node/execution_layer/src/lib.rs#L1688-L1692 Which would then mapped to a `Error::ForkchoiceUpdate` in `update_execution_engine_forkchoice`. Anyways, the fix was to make `total_difficulty` Optional, and then to update any code paths where it was used. In doing so, I assume that if the EL doesn't include total difficulty then the chain is already post-merge.

Removes the WIP chain indexer known as beacon watch.

N/A Derive ssz::Encode and Decode on the `SignedValidatorRegistrationData` type to use in the builder

sigp#7044) Replace the `2 + 2 == 5` hacks from `holesky-rescue` and use the existing `sync_tolerance_epochs` flag to control the proposer prep routines.

Unblock CI by ignoring cargo audit failures. IMO we need to merge some stuff to get our PR backlog under control and can't wait for audit fixes. I've opened issues to address the actual audit failures: - sigp#7090 - sigp#7091

This is a backport from `holesky-rescue`. Part of: - sigp#7039 Original PR to `holesky-rescue`: - sigp#7054 Avoid doing database lookups for slots that lie in the hot database when processing status messages. This avoids a DoS vector during non-finality, as loading hot states to iterate block roots is very expensive.

NA Bumps the `ethereum_ssz` version, along with other crates that share the dep. Primarily, this give us bitfields which can store 128 bytes on the stack before allocating, rather than 32 bytes (sigp/ethereum_ssz#38). The validator count has increase massively since we set it at 32 bytes, so aggregation bitfields (et al) now require a heap allocation. This new value of 128 should get us to ~2m active validators.

This updates to a version with the [compute_cells](https://github.com/ethereum/consensus-specs/blob/dev/specs/fulu/polynomial-commitments-sampling.md#compute_cells) method that allows you to extend a blob without creating a proof for it. Which issue # does this PR address? Please list or describe the changes introduced by this PR.

Cherry-picking sigp#7055 from `holesky-rescue` branch to the clean `release-v7.0.0` branch.

Delete duplicate sync tolerance epoch config in the HTTP API which is unused. We introduced the `sync-tolerance-epoch` flag in this PR: - sigp#7030 Then refined it in this PR: - sigp#7044 Somewhere in the merge of `release-v7.0.0` into `unstable`, the config from the original PR which had been deleted came back. I think I resolved these conflicts, so my bad.

We forked `gossipsub` into the lighthouse repo sometime ago so that we could iterate quicker on implementing back pressure and IDONTWANT. Meanwhile we have pushed all our changes upstream and we are now the main maintainers of `rust-libp2p` this allows us to use upstream `gossipsub` again. Nonetheless we still use our forked repo to give us freedom to experiment with features before submitting them upstream

@macladson

Tracing Integration - [reference](https://github.com/eth-protocol-fellows/cohort-five/blob/5bbf1859e921065bd69f8671038ed16643465b86/projects/project-ideas.md?plain=1#L297) - [x] replace slog & log with tracing throughout the codebase - [x] implement custom crit log - [x] make relevant changes in the formatter - [x] replace sloggers - [x] re-write SSE logging components cc: @macladson @eserilev

Partially sigp#7100 Set blob pruning to default to once per day

Resolves sigp#7091 The `prometheus` crate pulls in `protobuf 2.x` which fails cargo audit. We actually dont use any `protobuf` related features in LH. By disabling default features for `prometheus`, we no longer pull in the `protobuf` crate

dapplion and others added 30 commits February 10, 2025 07:55

Address cargo audit failure RUSTSEC-2025-0006 (sigp#6972)

0728140

PR to fix cargo audit failure and bump `hickory-proto` version: https://github.com/sigp/lighthouse/actions/runs/13252660830/job/36993664960

Merge remote-tracking branch 'origin/release-v7.0.0-beta.0' into unst…

ec2fe38

…able

Fix misc PeerDAS todos (sigp#6862)

3992d6b

Address misc PeerDAS TODOs that are not too big for a dedicated PR I'll justify each TODO on an inlined comment

Remove un-used batch sync error condition (sigp#6917)

431dd7c

- PR sigp#6497 made obsolete some consistency checks inside the batch I forgot to remove the consumers of those errors Remove un-used batch sync error condition, which was a nested `Result<_, Result<_, E>>`

Release v7.0.0-beta.0 (sigp#6962)

1888be5

New release for Electra on Holesky and Sepolia. Includes PRs: - sigp#6808 - sigp#6914 - sigp#6949 - sigp#6950 - sigp#6958

Merge remote-tracking branch 'origin/release-v7.0.0-beta.0' into unst…

6ab6eae

…able

Use RpcSend on RPC::self_limiter::ready_requests (sigp#6634)

193061f

We don't need to store `BehaviourAction` for `ready_requests` and therefore avoid having an `unreachable!` on sigp#6625. Therefore this PR should be merged before it

Rust 1.85 lints (sigp#7019)

b3b6aea

N/A 2 changes: 1. Replace Option::map_or(true, ...) with is_none_or(...) 2. Remove unnecessary `Into::into` blocks where the type conversion is apparent from the types

Merge remote-tracking branch 'origin/release-v7.0.0' into unstable

e5e43ec

Add spamoor_blob in network_params.yaml (sigp#7012)

b4be514

Have blobs by default in deneb runs of kurtosis

Expose blst internals (sigp#6829)

60964fc

feat: adds CLI flags to delay publishing for edge case testing on Pee…

6e11bdd

…rDAS devnets (sigp#6947) Closes sigp#6919

Merge remote-tracking branch 'origin/release-v7.0.0' into unstable

cf4104a

Cache validator registration only after successful publish (sigp#7034)

8a77252

Add --long-timeouts-multiplier CLI flag (sigp#7047)

29a295a

Adds the `--long-timeouts-multiplier` flag. Allows granular control for VC timeouts which has proved useful in Holesky.

Remove watch (sigp#7048)

1235d44

Removes the WIP chain indexer known as beacon watch.

eserilev and others added 19 commits March 5, 2025 06:21

Schedule Chiado testnet Electra hard fork (sigp#7074)

066f287

Validator Registration ssz support (sigp#7081)

3bc5f1f

N/A Derive ssz::Encode and Decode on the `SignedValidatorRegistrationData` type to use in the builder

Use sync_tolerance_epochs flag to control the proposer prep routines (

09849e8

sigp#7044) Replace the `2 + 2 == 5` hacks from `holesky-rescue` and use the existing `sync_tolerance_epochs` flag to control the proposer prep routines.

Temporarily ignore cargo audit failures (sigp#7092)

166f6df

Unblock CI by ignoring cargo audit failures. IMO we need to merge some stuff to get our PR backlog under control and can't wait for audit fixes. I've opened issues to address the actual audit failures: - sigp#7090 - sigp#7091

Merge remote-tracking branch 'origin/release-v7.0.0' into unstable

b4e79ed

Address cargo audit failure RUSTSEC-2025-0009 (sigp#7086)

0f5e680

Change state cache size default to 32 (sigp#7101)

9c4fc6e

Cherry-picking sigp#7055 from `holesky-rescue` branch to the clean `release-v7.0.0` branch.

Set epochs-per-blob-prune default to 256 (sigp#7113)

2c40f0b

Partially sigp#7100 Set blob pruning to default to once per day

Squashed reset to unstable

f61f0b6

Merge branch 'unstable' into into-anchor

4f62e98

Merge remote-tracking branch 'origin/release-v7.0.0' into into-anchor

ac53bee

Squashed changes from modularize-validator-store

3316681

dknopik requested a review from jxs as a code owner March 13, 2025 12:20

jimmygchen merged commit 03dfea3 into sigp:anchor Mar 13, 2025
28 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Anchor Pre-PR: `unstable`, `release-v7.0.0` and `modularize-validator-store`#7127

Anchor Pre-PR: `unstable`, `release-v7.0.0` and `modularize-validator-store`#7127
jimmygchen merged 49 commits intosigp:anchorfrom
dknopik:into-anchor

dknopik commented Mar 13, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

14 participants

Conversation

dknopik commented Mar 13, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

14 participants