Skip to content

Comments

Add CPU bound and IO bound worker management to the beacon processor#23

Closed
eserilev wants to merge 79 commits intowork-queue-refactorfrom
cpu-bound-and-io-bound-workers
Closed

Add CPU bound and IO bound worker management to the beacon processor#23
eserilev wants to merge 79 commits intowork-queue-refactorfrom
cpu-bound-and-io-bound-workers

Conversation

@eserilev
Copy link
Owner

@eserilev eserilev commented Nov 7, 2025

Issue Addressed

Which issue # does this PR address?

Proposed Changes

Please list or describe the changes introduced by this PR.

Additional Info

Please provide any additional information. For example, future considerations
or information useful for reviewers.

eserilev and others added 21 commits November 7, 2025 01:41
State advances were observed as especially slow on pre-Fulu networks (mainnet).

The reason being: we were doing an extra epoch of state advance because of code that should only have been running after Fulu, when proposer shufflings are determined with lookahead.


  Only attempt to cache the _next epoch_ shuffling if the state's slot determines it (this will only be true post-Fulu). Reusing the logic for `proposer_shuffling_decision_slot` avoids having to repeat the fiddly logic about the Fulu fork epoch itself.


Co-Authored-By: Michael Sproul <michael@sigmaprime.io>
Co-Authored-By: antondlr <anton@sigmaprime.io>
Fix an issue detected by @jimmygchen that occurs when checkpoint sync is aborted midway and then later restarted.

The characteristic error is something like:

> Nov 13 00:51:35.832 ERROR Database write failed                         error: Hdiff(LessThanStart(Slot(1728288), Slot(1728320))), action: "reverting blob DB changes"
Nov 13 00:51:35.833 WARN  Hot DB pruning failed                         error: DBError(HotColdDBError(Rollback))

This issue has existed since v7.1.0.


  Delete snapshot/diff in the case where `hot_storage_strategy` fails.


Co-Authored-By: Michael Sproul <michael@sigmaprime.io>
Fix the span on execution payload verification (newPayload), by creating a new span rather than using the parent span. Using the parent span was incorrectly associating the time spent verifying the payload with `from_signature_verified_components`.


Co-Authored-By: Michael Sproul <michael@sigmaprime.io>
…8413)

Addressed this comment here: sigp#6837 (comment)

Lighthouse can only checkpoint sync from a server that can serve blob sidecars, which means they need to be at least custdoying 50% of columns (semi-supernodes)

This PR lifts this constraint, as blob sidecar endpoint is getting deprecated in Fulu, and we plan to fetch the checkpoint data columns from peers (sigp#6837)


  


Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>
…igp#8391)

Take 2 of sigp#8390.

Fixes the race condition properly instead of propagating the error. I think this is a better alternative, and doesn't seem to look that bad.


  * Lift node id loading or generation from `NetworkService ` startup to the `ClientBuilder`, so that it can be used to compute custody columns for the beacon chain without waiting for Network bootstrap.

I've considered and implemented a few alternatives:
1. passing `node_id` to beacon chain builder and compute columns when creating `CustodyContext`. This approach isn't good for separation of concerns and isn't great for testability
2. passing `ordered_custody_groups` to beacon chain. `CustodyContext` only uses this to compute ordered custody columns, so we might as well lift this logic out, so we don't have to do error handling in `CustodyContext` construction. Less tests to update;.


Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>
N/A


  The difference is computed by taking the difference of expected with received. We were doing the inverse.

Thanks to Yassine for finding the issue.


Co-Authored-By: Pawan Dhananjay <pawandhananjay@gmail.com>
Once sigp#8271 is merged, CI will only cover tests for `RECENT_FORKS` (prev, current, next)

To make sure functionalities aren't broken for prior forks, we run tests for these forks nightly. They can also be manually triggered.

Tested via manual trigger here: https://github.com/jimmygchen/lighthouse/actions/runs/18896690117

<img width="826" height="696" alt="image" src="https://github.com/user-attachments/assets/afdfb03b-a037-4094-9f1b-7466c0800f6b" />


  


Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>

Co-Authored-By: Jimmy Chen <jimmy@sigmaprime.io>
The merge queue is failing due to md lint changes:

https://github.com/sigp/lighthouse/actions/runs/19491272535/job/55783746002

This PR fixes the lint. I'm targeting the release branch so we can get things merged for release tomorrow, and we'll merge back down to `unstable`.


  


Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>

Co-Authored-By: Michael Sproul <michael@sigmaprime.io>
We want to not require checkpoint sync starts to include the required custody data columns, and instead fetch them from p2p.


Closes sigp#6837


  The checkpoint sync slot can:
1. Be the first slot in the epoch, such that the epoch of the block == the start checkpoint epoch
2. Be in an epoch prior to the start checkpoint epoch

In both cases backfill sync already fetches that epoch worth of blocks with current code. This PR modifies the backfill import filter function to allow to re-importing the oldest block slot in the DB.

I feel this solution is sufficient unless I'm missing something. ~~I have not tested this yet!~~ Michael has tested this and it works.


Co-Authored-By: dapplion <35266934+dapplion@users.noreply.github.com>

Co-Authored-By: Michael Sproul <michael@sigmaprime.io>
This hot fix release includes the following fixes:
* sigp#8388
* sigp#8406
* sigp#8391
* sigp#8413


  


Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>
Co-Authored-By: Age Manning <Age@AgeManning.com>
This is a `tracing`-driven optimisation. While investigating why Lighthouse is slow to send `newPayload`, I found a suspicious 13ms of computation on the hot path in `gossip_block_into_execution_pending_block_slashable`:

<img width="1998" height="1022" alt="headercalc" src="https://github.com/user-attachments/assets/e4f88c1a-da23-47b4-b533-cf5479a1c55c" />

Looking at the current implementation we can see that the _only_ thing that happens prior to calling into `from_gossip_verified_block` is the calculation of a `header`. We first call `SignatureVerifiedBlock::from_gossip_verified_block_check_slashable`:

https://github.com/sigp/lighthouse/blob/261322c3e3ee467c9454fa160a00866439cbc62f/beacon_node/beacon_chain/src/block_verification.rs#L1075-L1076

Which is where the `header` is calculated prior to calling `from_gossip_verified_block`:

https://github.com/sigp/lighthouse/blob/261322c3e3ee467c9454fa160a00866439cbc62f/beacon_node/beacon_chain/src/block_verification.rs#L1224-L1226

Notice that the `header` is _only_ used in the case of an error, yet we spend time computing it every time!


  This PR moves the calculation of the header (which involves hashing the whole beacon block, including the execution payload), into the error case. We take a cheap clone of the `Arc`'d beacon block on the hot path, and use this for calculating the header _only_ in the case an error actually occurs. This shaves 10-20ms off our pre-newPayload delays, and 10-20ms off every block processing 🎉


Co-Authored-By: Michael Sproul <michael@sigmaprime.io>
Currently whenever we build the `Dockerfile` file for local development using kurtosis , it recompiles everything on my laptop, even if no changes are made.  This takes about 120 seconds on my laptop (might be faster on others).


  Conservatively, I created a new Dockerfile.dev, so that the original file is kept the same, even though its pretty similar.

This uses `--mount-type=cache` saving the target and registry folder across builds.

**Usage**

```sh
docker build -f Dockerfile.dev -t lighthouse:dev .
```


Co-Authored-By: Kevaundray Wedderburn <kevtheappdev@gmail.com>
Update `reqwest` to 0.12 so we only depend on a single version. This should slightly improve compile times and reduce binary bloat.


Co-Authored-By: Michael Sproul <michael@sigmaprime.io>
When developing locally with kurtosis a typical dev workflow is:

loop:
- Build local lighthouse docker image
- Run kurtosis
- Observe bug
- Fix code

The docker build step would download and build all crates. Docker docs suggests an optimization to cache build artifacts, see https://docs.docker.com/build/cache/optimize/#use-cache-mounts

I have tested and it's like building Lighthouse outside of a docker environment 🤤  The docker build time after changing one line in the top beacon_node crate is 50 seconds on my local machine ❤️

The release path is un-affected. Do you have worries this can affect the output of the release binaries? This is too good of an improvement to keep it in a separate Dockerfile.


  


Co-Authored-By: dapplion <35266934+dapplion@users.noreply.github.com>
michaelsproul and others added 6 commits November 26, 2025 23:00
Since merging this PR, we don't need `--checkpoint-blobs`, even prior to Fulu:

- sigp#8417

This PR removes the mandatory check for blobs prior to Fulu, enabling simpler manual checkpoint sync.


Co-Authored-By: Michael Sproul <michael@sigmaprime.io>

Co-Authored-By: Jimmy Chen <jimmy@sigmaprime.io>
Co-Authored-By: Tan Chee Keong <tanck@sigmaprime.io>
Consolidate our property-testing around `proptest`. This PR was written with Copilot and manually tweaked.


Co-Authored-By: Michael Sproul <michael@sproul.xyz>

Co-Authored-By: Michael Sproul <michael@sigmaprime.io>
Fixes sigp#7785


  - [x] Update all integration tests with >1 files to follow the `main` pattern.
- [x] `crypto/eth2_key_derivation/tests`
- [x] `crypto/eth2_keystore/tests`
- [x] `crypto/eth2_wallet/tests`
- [x] `slasher/tests`
- [x] `common/eth2_interop_keypairs/tests`
- [x] `beacon_node/lighthouse_network/tests`
- [x] Set `debug_assertions` to false on `.vscode/settings.json`.
- [x] Document how to make rust analyzer work on integration tests files. In `book/src/contributing_setup.md`

---

Tracking a `rust-analyzer.toml` with settings like the one provided in `.vscode/settings.json` would be nicer. But this is not possible yet. For now, that config should be a good enough indicator for devs using editors different to VSCode.


Co-Authored-By: Daniel Ramirez-Chiquillo <hi@danielrachi.com>

Co-Authored-By: Michael Sproul <michael@sigmaprime.io>
Use the recently published `context_deserialize` and remove it from Lighthouse


Co-Authored-By: Mac L <mjladson@pm.me>

Co-Authored-By: Michael Sproul <michael@sigmaprime.io>
…ckerHub (sigp#7614)

This pull request introduces workflows and updates to ensure reproducible builds for the Lighthouse project. It adds two GitHub Actions workflows for building and testing reproducible Docker images and binaries, updates the `Makefile` to streamline reproducible build configurations, and modifies the `Dockerfile.reproducible` to align with the new build process. Additionally, it removes the `reproducible` profile from `Cargo.toml`.


  ### New GitHub Actions Workflows:

* [`.github/workflows/docker-reproducible.yml`](diffhunk://#diff-222af23bee616920b04f5b92a83eb5106fce08abd885cd3a3b15b8beb5e789c3R1-R145): Adds a workflow to build and push reproducible multi-architecture Docker images for releases, including support for dry runs without pushing an image.

### Build Configuration Updates:

* [`Makefile`](diffhunk://#diff-76ed074a9305c04054cdebb9e9aad2d818052b07091de1f20cad0bbac34ffb52L85-R143): Refactors reproducible build targets, centralizes environment variables for reproducibility, and updates Docker build arguments for `x86_64` and `aarch64` architectures.
* [`Dockerfile.reproducible`](diffhunk://#diff-587298ff141278ce3be7c54a559f9f31472cc5b384e285e2105b3dee319ba31dL1-R24): Updates the base Rust image to version 1.86, removes hardcoded reproducibility settings, and delegates build logic to the `Makefile`.
* Switch to using jemalloc-sys from Debian repos instead of building it from source. A Debian version is [reproducible](https://tests.reproducible-builds.org/debian/rb-pkg/trixie/amd64/jemalloc.html) which is [hard to achieve](NixOS/nixpkgs#380852) if you build it from source.

### Profile Removal:

* [`Cargo.toml`](diffhunk://#diff-2e9d962a08321605940b5a657135052fbcef87b5e360662bb527c96d9a615542L289-L295): Removes the `reproducible` profile, simplifying build configurations and relying on external tooling for reproducibility.


Co-Authored-By: Moe Mahhouk <mohammed-mahhouk@hotmail.com>

Co-Authored-By: chonghe <44791194+chong-he@users.noreply.github.com>

Co-Authored-By: Michael Sproul <michaelsproul@users.noreply.github.com>
macladson and others added 29 commits December 15, 2025 03:20
sigp#8547


  Update our `strum` dependency to `0.27`. This unifies our strum dependencies and removes our duplication of `strum` (and by extension, `strum_macros`).


Co-Authored-By: Mac L <mjladson@pm.me>

Co-Authored-By: Michael Sproul <michaelsproul@users.noreply.github.com>
sigp#8547


  We are currently using an older version of `syn` in `test_random_derive`. Updating this removes one of the sources of `syn` `1.0.109` in our dependency tree.


Co-Authored-By: Mac L <mjladson@pm.me>

Co-Authored-By: Michael Sproul <michaelsproul@users.noreply.github.com>
Fixes the error `fatal: No names found, cannot describe anything.` that occurs when running `make`
commands in CI (GitHub Actions).

https://github.com/sigp/lighthouse/actions/runs/19839541042/job/56844781126#step:5:13
> fatal: No names found, cannot describe anything.


  Changed the `GIT_TAG` variable assignment in the Makefile from immediate evaluation to lazy evaluation:

```diff
- GIT_TAG := $(shell git describe --tags --candidates 1)
+ GIT_TAG = $(shell git describe --tags --candidates 1)
```

This change ensures that git describe is only executed when `GIT_TAG` is actually used (in the `build-release-tarballs` target), rather than on every Makefile invocation.


Co-Authored-By: ackintosh <sora.akatsuki@gmail.com>
…sigp#8499)

Which issue # does this PR address?
None


  The `visualize_batch_state`  functions uses the following loop `for mut batch_index in 0..BATCH_BUFFER_SIZE`, making it from `0` to `BATCH_BUFFER_SIZE - 1` (behind the scenes).

Hence we would never hit the following condition:

```rust
if batch_index != BATCH_BUFFER_SIZE {
visualization_string.push(',');
}
```

Replacing `!=` with `<` & `BATCH_BUFFER_SIZE -1` allows for the following change:

`[A,B,C,D,E,]` to become: `[A,B,C,D,E]`


Co-Authored-By: Antoine James <antoine@ethereum.org>
* sigp#7201


  


Co-Authored-By: Tan Chee Keong <tanck@sigmaprime.io>

Co-Authored-By: chonghe <44791194+chong-he@users.noreply.github.com>

Co-Authored-By: Jimmy Chen <jimmy@sigmaprime.io>

Co-Authored-By: Tan Chee Keong <tanck2005@gmail.com>
* sigp#7850

This is the first round of the conga line! 🎉

Just spec constants and container changes so far.


  


Co-Authored-By: shane-moore <skm1790@gmail.com>

Co-Authored-By: Mark Mackey <mark@sigmaprime.io>

Co-Authored-By: Shane K Moore <41407272+shane-moore@users.noreply.github.com>

Co-Authored-By: Eitan Seri- Levi <eserilev@gmail.com>

Co-Authored-By: ethDreamer <37123614+ethDreamer@users.noreply.github.com>

Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>

Co-Authored-By: Jimmy Chen <jimmy@sigmaprime.io>

Co-Authored-By: Michael Sproul <michael@sigmaprime.io>
While reviewing Gloas I noticed we were updating `PartialBeaconState`. This code isn't used since v7.1.0 introduced hdiffs, so we can delete it and stop maintaining it 🎉

Similarly the `chunked_vector`/`chunked_iter` code can also go!


Co-Authored-By: Michael Sproul <michael@sigmaprime.io>

Co-Authored-By: Pawan Dhananjay <pawandhananjay@gmail.com>
Closes:

- sigp#8408


  Add `cargo deny` on CI with deprecated crates (`ethers` and `ethereum-types`) banned and duplicates banned for `reqwest`.


Co-Authored-By: Michael Sproul <michael@sigmaprime.io>
A few `cargo-deny` tweaks with @macladson

Co-authored-by: Mac L <mjladson@pm.me>


Co-Authored-By: Michael Sproul <michael@sigmaprime.io>

Co-Authored-By: Mac L <mjladson@pm.me>
sigp#8547


  This is a low hanging fruit dependency update to remove one of the duplicate versions of `rustix`


Co-Authored-By: Mac L <mjladson@pm.me>
I was resolving CI issues for my gloas block production [PR ](sigp#8313), and noticed the `make audit-CI` [check](https://github.com/sigp/lighthouse/actions/runs/20588442102/job/59129268003) was failing due to:
```
Crate:     ruint
Version:   1.17.0
Title:     Unsoundness of safe `reciprocal_mg10`
Date:      2025-12-22
ID:        RUSTSEC-2025-0137
URL:       https://rustsec.org/advisories/RUSTSEC-2025-0137
Solution:  Upgrade to >=1.17.1
```


  Using the latest stable rust, `1.92.0`, I ran `cargo update ruint` -> `cargo check` -> `make audit-CI`, which passed


Co-Authored-By: shane-moore <skm1790@gmail.com>
Which issue # does this PR address?
sigp#8586


  Please list or describe the changes introduced by this PR.
Remove `service_name` from `TaskExecutor`


Co-Authored-By: Abhivansh <31abhivanshj@gmail.com>
…igp#8614)

This PR does two small things:

- Removes the allocations that were happening on each loop
- Makes it more explicit that the bit in the index is only being used to specify the order of the inputs for the hash function


  


Co-Authored-By: Kevaundray Wedderburn <kevtheappdev@gmail.com>
Closes sigp#8569


  Updates the HTTP API error when the node cannot reconstruct blobs due to "Insufficient data columns".

Changes the response from 500 Internal Server Error to 400 Bad Request and adds a hint to run with --supernode or --semi-supernode.


Co-Authored-By: Andrurachi <andruvrch@gmail.com>
Fixes attester cache write lock contention. Alternative to sigp#8463.


  


Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>
Co-Authored-By: shane-moore <skm1790@gmail.com>
sigp#8547


  This unifies the following `crypto` dependencies to a single version each:

- `sha2`
- `hmac`
- `pbkdf2`
- `aes`
- `cipher`
- `ctr`
- `scrypt`
- `digest`


Co-Authored-By: Mac L <mjladson@pm.me>
```bash
$ lcli mock-el ....
...
...
Dec 15 11:52:06.002 INFO  Metrics HTTP server started                   listen_address: "127.0.0.1:8551"
...
```

The log message "Metrics HTTP server" was misleading, as the server is actually a Mock Execution Client that provides a JSON-RPC API for testing purposes, not a metrics server.


  


Co-Authored-By: ackintosh <sora.akatsuki@gmail.com>
Co-Authored-By: Tan Chee Keong <tanck@sigmaprime.io>
…gp#8498)

Which issue # does this PR address?
None


  Discussed in private with @jimmygchen, Lighthouse's `earliest_available_slot` is guaranteed to always align with epoch boundaries, but as a safety implementation, we should use `start_slot` just in case other clients differ in their implementations.

At least we agreed it would be safer for `synced_peers_for_epoch`, I also made the change in `has_good_custody_range_sync_peer`, but this is to be reviewed please.


Co-Authored-By: Antoine James <antoine@ethereum.org>

Co-Authored-By: Jimmy Chen <jimmy@sigmaprime.io>
Just visual clean-up, making logging statements look uniform. There's no reason to use `tracing::debug` instead of `debug`. If we ever need to migrate our logging lib in the future it would make things easier too.


  


Co-Authored-By: dapplion <35266934+dapplion@users.noreply.github.com>

Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>

Co-Authored-By: Michael Sproul <michaelsproul@users.noreply.github.com>
…#8141)

Co-Authored-By: Eitan Seri- Levi <eserilev@gmail.com>

Co-Authored-By: Eitan Seri-Levi <eserilev@ucsc.edu>
N/A


  The `beacon_data_column_sidecar_computation_seconds` used to record the full kzg proof generation times before we changed getBlobsV2 to just return the full proofs + cells. This metric should be taking way less time than 100ms which was the minimum bucket previously.

Update the metric to use the default buckets for better granularity.


Co-Authored-By: Pawan Dhananjay <pawandhananjay@gmail.com>
N/A


  Add standardized metrics for getBlobsV2 from ethereum/beacon-metrics#14.


Co-Authored-By: Pawan Dhananjay <pawandhananjay@gmail.com>
…t from eth2_network_config (sigp#8638)

sigp#5019


  If there is a known eth2_network_config, we read the genesis time and validators root from the config.


Co-Authored-By: Jimmy Chu <898091+jimmychu0807@users.noreply.github.com>
@eserilev eserilev closed this Jan 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.