Skip to content

Anchor Pre-PR: unstable, release-v7.0.0 and modularize-validator-store#7127

Merged
jimmygchen merged 49 commits intosigp:anchorfrom
dknopik:into-anchor
Mar 13, 2025
Merged

Anchor Pre-PR: unstable, release-v7.0.0 and modularize-validator-store#7127
jimmygchen merged 49 commits intosigp:anchorfrom
dknopik:into-anchor

Conversation

@dknopik
Copy link
Member

@dknopik dknopik commented Mar 13, 2025

Same as last time: this is a PR for the anchor branch and requires no in-depth review.

This PR resets the branch to the state of unstable and additionally merges in release-v7.0.0 and modularize-validator-store. We're close to being the same as unstable! 🥳

Please merge without squashing!

dapplion and others added 30 commits February 10, 2025 07:55
Currently we have very poor coverage of range sync with unit tests. With the event driven test framework we could cover much more ground and be confident when modifying the code.


  Add two basic cases:
- Happy path, complete a finalized sync for 2 epochs
- Post-PeerDAS case where we start without enough custody peers and later we find enough

⚠️  If you have ideas for more test cases, please let me know! I'll write them
Address misc PeerDAS TODOs that are not too big for a dedicated PR


  I'll justify each TODO on an inlined comment
Currently we track a key metric `PEERS_PER_COLUMN_SUBNET` in a getter `good_peers_on_sampling_subnets`. Another PR sigp#6922 deletes that function, so we have to move the metric anyway. This PR moves that metric computation to the metrics spawned task which is refreshed every 5 seconds.

I also added a few more useful metrics. The total set and intended usage is:

- `sync_peers_per_column_subnet`: Track health of overall subnets in your node
- `sync_peers_per_custody_column_subnet`: Track health of the subnets your node needs. We should track this metric closely in our dashboards with a heatmap and bar plot
- ~~`sync_column_subnets_with_zero_peers`: Is equivalent to the Grafana query `count(sync_peers_per_column_subnet == 0) by (instance)`. We may prefer to skip it, but I believe it's the most important metric as if `sync_column_subnets_with_zero_peers > 0` your node stalls.~~
- ~~`sync_custody_column_subnets_with_zero_peers`: `count(sync_peers_per_custody_column_subnet == 0) by (instance)`~~
- PR sigp#6497 made obsolete some consistency checks inside the batch

I forgot to remove the consumers of those errors


  Remove un-used batch sync error condition, which was a nested `Result<_, Result<_, E>>`
Addresses sigp#6854.

PeerDAS requires unsubscribing a Gossip topic at a fork boundary. This is not possible with our current topic machinery.


  Instead of defining which topics have to be **added** at a given fork, we define the complete set of topics at a given fork. The new start of the show and key function is:

```rust
pub fn core_topics_to_subscribe<E: EthSpec>(
fork_name: ForkName,
opts: &TopicConfig,
spec: &ChainSpec,
) -> Vec<GossipKind> {
// ...
if fork_name.deneb_enabled() && !fork_name.fulu_enabled() {
// All of deneb blob topics are core topics
for i in 0..spec.blob_sidecar_subnet_count(fork_name) {
topics.push(GossipKind::BlobSidecar(i));
}
}
// ...
}
```

`core_topics_to_subscribe` only returns the blob topics if `fork < Fulu`. Then at the fork boundary, we subscribe with the new fork digest to `core_topics_to_subscribe(next_fork)`, which excludes the blob topics.

I added `is_fork_non_core_topic` to carry on to the next fork the aggregator topics for attestations and sync committee messages. This approach is future-proof if those topics ever become fork-dependent.

Closes sigp#6854
…le (sigp#6984)

Closes sigp#6983


  `GET v2/validator/aggregate_attestation` is not backwards compatible. It only works for post electra attestations. This PR adds backwards compatibility and additional test coverage. We should include this in the upcoming 7.0 beta release if possible
Our Holesky nodes running with the light client enabled were logging messages about full queues:

> Feb 12 22:09:28.949 ERRO Work queue is full                      queue: unknown_light_client_optimistic_update, queue_len: 128, msg: the system has insufficient resources for load, service: bproc

I thought this might be genuine overload, but it turns out this queue was never being read from!


  - [x] Rename light-client related queues in the beacon processor for clarity.
- [x] Ensure all light-client related queues are being popped from.
New release for Electra on Holesky and Sepolia.

Includes PRs:

- sigp#6808
- sigp#6914
- sigp#6949
- sigp#6950
- sigp#6958
Fix a regression introduced in this PR:

- sigp#6361

We were indexing into the `MerkleTree` with raw generalized indices, which was incorrect and triggering `debug_assert` failures, as described here:

- sigp#7005


  - Convert `generalized_index` to the correct leaf index prior to proof generation.
- Add sanity checks on indices used in `BeaconState::generate_proof`.
- Remove debug asserts from `MerkleTree::generate_proof` in favour of actual errors. This would have caught the bug earlier.
- Refactor the EF tests so that the merkle validity tests are actually run. They were misconfigured in a way that resulted in them running silently with 0 test cases, and the `check_all_files_accessed.py` script still had an ignore that covered the test files, so this omission wasn't detected.
We don't need to store `BehaviourAction` for `ready_requests` and therefore avoid having an `unreachable!` on sigp#6625.
Therefore this PR should be merged before it
N/A


  2 changes:
1. Replace Option::map_or(true, ...) with is_none_or(...)
2. Remove unnecessary `Into::into` blocks where the type conversion is apparent from the types
Resolves sigp#7000


  Set the accept header on builder to the correct value when requesting ssz.

This PR also adds a flag to disable ssz over the builder api altogether. In the case that builders/relays have an ssz bug, we can react quickly by asking clients to restart their nodes with the `--disable-ssz-builder` flag to force json. I'm not fully convinced if this is useful so open to removing it or opening another PR for it.

Testing this currently.
Have blobs by default in deneb runs of kurtosis
I keep being notified for PR's like sigp#7009 where it doesn't touch the specified directories in the `CODEOWNERS` file.
After reading the [docs](https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/customizing-your-repository/about-code-owners) not having a forward slash in beginning of the path means:
>  In this example, @octocat owns any file in an apps directory
> anywhere in your repository.

whereas with the slashes:
> In this example, @doctocat owns any file in the `/docs`
> directory in the root of your repository and any of its
> subdirectories.

this update makes it more rigid for the files in the jxs ownership
PeerDAS has undergone multiple refactors + the blending with the get_blobs optimization has generated technical debt.

A function signature like this

https://github.com/sigp/lighthouse/blob/f008b84079bbb6eb86de22bb3421dfc8263a5650/beacon_node/beacon_chain/src/beacon_chain.rs#L7171-L7178

Allows at least the following combination of states:
- blobs: Some / None
- data_columns: Some / None
- data_column_recv: Some / None
- Block has data? Yes / No
- Block post-PeerDAS? Yes / No

In reality, we don't have that many possible states, only:

- `NoData`: pre-deneb, pre-PeerDAS with 0 blobs or post-PeerDAS with 0 blobs
- `Blobs(BlobSidecarList<E>)`: post-Deneb pre-PeerDAS with > 0 blobs
- `DataColumns(DataColumnSidecarList<E>)`: post-PeerDAS with > 0 blobs
- `DataColumnsRecv(oneshot::Receiver<DataColumnSidecarList<E>>)`: post-PeerDAS with > 0 blobs, but we obtained the columns via reconstruction

^ this are the variants of the new `AvailableBlockData` enum

So we go from 2^5 states to 4 well-defined. Downstream code benefits nicely from this clarity and I think it makes the whole feature much more maintainable.

Currently `is_available` returns a bool, and then we construct the available block in `make_available`. In a way the availability condition is duplicated in both functions. Instead, this PR constructs `AvailableBlockData` in `is_available` so the availability conditions are written once

```rust
if let Some(block_data) = is_available(..) {
let available_block = make_available(block_data);
}
```
Partly addresses

- sigp#6959


  Use the `enable_light_client_server` field from the beacon chain config in the HTTP API. I think we can make this the single source of truth, as I think the network crate also has access to the beacon chain config.
…ing (sigp#7030)

Related to sigp#6880, an issue that's usually observed on local devnets with small number of nodes.

When testing range sync, I usually shutdown a node for some period of time and restart it again. However, if it's within `SYNC_TOLERANCE_EPOCHS` (8), Lighthouse would consider the node as synced, and if it may attempt to produce a block if requested by a validator - on a local devnet, nodes frequently produce blocks - when this happens, the node ends up producing a block that would revert finality and would get disconnected from peers immediately.

### Usage

Run Lighthouse BN with this flag to override:

```
--sync-tolerance--epoch 0
```
…ing (sigp#7030)

Related to sigp#6880, an issue that's usually observed on local devnets with small number of nodes.

When testing range sync, I usually shutdown a node for some period of time and restart it again. However, if it's within `SYNC_TOLERANCE_EPOCHS` (8), Lighthouse would consider the node as synced, and if it may attempt to produce a block if requested by a validator - on a local devnet, nodes frequently produce blocks - when this happens, the node ends up producing a block that would revert finality and would get disconnected from peers immediately.

NOTE: This is PR#7030 cherry-picked from `unstable` to `release-v7.0.0`.

Run Lighthouse BN with this flag to override:

```
--sync-tolerance--epoch 0
```
Cleaned up and isolated version of the `--disable-attesting` flag for the VC, from the `holesky-rescue` branch:

- sigp#7041

I figured we don't need the `--disable-attesting` flag on the BN for now, and it was a much more invasive impl.
Adds the `--long-timeouts-multiplier` flag.
Allows granular control for VC timeouts which has proved useful in Holesky.
This change makes the `total_difficulty` field in `ExecutionBlock` an `Option<Uint256>` since newer clients are no longer including the `totalDifficulty` field.

I think this will fix sigp#6937 but I was actually more focused on the builder registration case described below.

In our [builder-playground](https://github.com/flashbots/builder-playground) we setup a local devnet using lighthouse, reth, and mev-boost-relay.  After upgrading to reth 1.2.0 and lighthouse v7.0.0.beta.0 for Pectra, we noticed that the validator registration process was _sometimes_ failing with:

```
Feb 25 23:35:25.038 ERRO Unable to publish proposer preparation to all beacon nodes, error: Some endpoints failed, num_failed: 1 http://localhost:3500/ => RequestFailed(ServerMessage(ErrorMessage { code: 400, message: "BAD_REQUEST: error updating proposer preparations: ForkchoiceUpdate(EngineError(Api { error: Json(Error(\"missing field `totalDifficulty`\", line: 0, column: 0)) }))", stacktraces: [] })), service: preparation
Feb 25 23:35:25.099 WARN Unable to publish validator registrations to the builder network, error: Some endpoints failed, num_failed: 1 http://localhost:3500/ => RequestFailed(ServerMessage(ErrorMessage { code: 400, message: "BAD_REQUEST: error updating proposer preparations: ForkchoiceUpdate(EngineError(Api { error: Json(Error(\"missing field `totalDifficulty`\", line: 0, column: 0)) }))", stacktraces: [] })), service: preparation
```

What was even more confusing, was that it was sometimes working, which actually led to a wild goose chase thinking it was a networking issue.  However, when tracing through the LH code, I came across this comment:

https://github.com/sigp/lighthouse/blob/70194dfc6a3f4d10c9059610f889ff5a4e863a6a/beacon_node/beacon_chain/src/beacon_chain.rs#L6048-L6049

This explained why it sometimes worked, in our playground we run lighthouse with `--prepare-payload-lookahead 8000` thus there was always a 4-second window where the call wasn't made.

But, if the call was made, then this code would 100% fail with updated reth:

https://github.com/sigp/lighthouse/blob/unstable/beacon_node/execution_layer/src/lib.rs#L1688-L1692

Which would then mapped to a `Error::ForkchoiceUpdate` in `update_execution_engine_forkchoice`.


  Anyways, the fix was to make `total_difficulty` Optional, and then to update any code paths where it was used.  In doing so, I assume that if the EL doesn't include total difficulty then the chain is already post-merge.
Removes the WIP chain indexer known as beacon watch.
eserilev and others added 19 commits March 5, 2025 06:21
N/A


  Derive ssz::Encode and Decode on the `SignedValidatorRegistrationData` type to use in the builder
sigp#7044)

Replace the `2 + 2 == 5` hacks from `holesky-rescue` and use the existing `sync_tolerance_epochs` flag to control the proposer prep routines.
Unblock CI by ignoring cargo audit failures. IMO we need to merge some stuff to get our PR backlog under control and can't wait for audit fixes. I've opened issues to address the actual audit failures:

- sigp#7090
- sigp#7091
This is a backport from `holesky-rescue`.

Part of:

- sigp#7039

Original PR to `holesky-rescue`:

- sigp#7054


  Avoid doing database lookups for slots that lie in the hot database when processing status messages. This avoids a DoS vector during non-finality, as loading hot states to iterate block roots is very expensive.
NA


  Bumps the `ethereum_ssz` version, along with other crates that share the dep.

Primarily, this give us bitfields which can store 128 bytes on the stack before allocating, rather than 32 bytes (sigp/ethereum_ssz#38). The validator count has increase massively since we set it at 32 bytes, so aggregation bitfields (et al) now require a heap allocation. This new value of 128 should get us to ~2m active validators.
This updates to a version with the [compute_cells](https://github.com/ethereum/consensus-specs/blob/dev/specs/fulu/polynomial-commitments-sampling.md#compute_cells) method that allows you to extend a blob without creating a proof for it.

Which issue # does this PR address?


  Please list or describe the changes introduced by this PR.
Cherry-picking sigp#7055 from `holesky-rescue` branch to the clean `release-v7.0.0` branch.
Delete duplicate sync tolerance epoch config in the HTTP API which is unused.

We introduced the `sync-tolerance-epoch` flag in this PR:

- sigp#7030

Then refined it in this PR:

- sigp#7044

Somewhere in the merge of `release-v7.0.0` into `unstable`, the config from the original PR which had been deleted came back. I think I resolved these conflicts, so my bad.
We forked `gossipsub` into the lighthouse repo sometime ago so that we could iterate quicker on implementing back pressure and IDONTWANT.
Meanwhile we have pushed all our changes upstream and we are now the main maintainers of `rust-libp2p` this allows us to use upstream `gossipsub` again.
Nonetheless we still use our forked repo to give us freedom to experiment with features before submitting them upstream
Tracing Integration
- [reference](https://github.com/eth-protocol-fellows/cohort-five/blob/5bbf1859e921065bd69f8671038ed16643465b86/projects/project-ideas.md?plain=1#L297)


  - [x] replace slog & log with tracing throughout the codebase
- [x] implement custom crit log
- [x] make relevant changes in the formatter
- [x] replace sloggers
- [x] re-write SSE logging components

cc: @macladson @eserilev
Partially sigp#7100


  Set blob pruning to default to once per day
Resolves sigp#7091


  The `prometheus` crate pulls in `protobuf 2.x` which fails cargo audit. We actually dont use any `protobuf` related features in LH. By disabling default features for `prometheus`, we no longer pull in the `protobuf` crate
@dknopik dknopik requested a review from jxs as a code owner March 13, 2025 12:20
@jimmygchen jimmygchen merged commit 03dfea3 into sigp:anchor Mar 13, 2025
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.