Skip to content

Comments

Force consumers to decide what finality checkpoints they want to read#52

Closed
dapplion wants to merge 19 commits intounstablefrom
fork-choice-optimistic-mode
Closed

Force consumers to decide what finality checkpoints they want to read#52
dapplion wants to merge 19 commits intounstablefrom
fork-choice-optimistic-mode

Conversation

@dapplion
Copy link
Owner

@dapplion dapplion commented Oct 17, 2025

Fork-choice internal checkpoints may not be the ones the network agrees on. If the node starts with checkpoint sync we set the checkpoints to something that does not match the head state at the moment. If checkpoint sync is not used with a finalized checkpoint it can lead to confusion in the code.

This PR makes this oddity explicit. Consumers must choose between the local checkpoints (potentially ahead of the network wide onchain checkpoint), or the actual finality checkpoint of the network.

TODO:

Clarify in Beacon API spec that "finality" tag refers to the network wide finality: https://github.com/ethereum/beacon-APIs/blob/8abac03526770b10ab49be4d186d468629127413/params/index.yaml#L10

Also clarify what the finalized hash for the EL means, if network wide or local https://github.com/ethereum/beacon-APIs/blob/8abac03526770b10ab49be4d186d468629127413/params/index.yaml#L10

@dapplion dapplion force-pushed the fork-choice-optimistic-mode branch 3 times, most recently from bc235b2 to 51f0968 Compare October 19, 2025 22:27
@dapplion dapplion force-pushed the fork-choice-optimistic-mode branch from 51f0968 to 1b90859 Compare October 19, 2025 22:29
@dapplion
Copy link
Owner Author

Test of ff431ac

Started a kurtosis network with 6 participants on latest unstable

participants:
  - el_type: geth
    el_image: ethereum/client-go:latest
    cl_type: lighthouse
    cl_image: sigp/lighthouse:latest-unstable
    cl_extra_params:
      - --target-peers=7
    vc_extra_params:
      - --use-long-timeouts
      - --long-timeouts-multiplier=3
    count: 6
    validator_count: 16
network_params:
  electra_fork_epoch: 0
  seconds_per_slot: 3
  genesis_delay: 400
global_log_level: debug
snooper_enabled: false
additional_services:
  - dora
  - spamoor
  - prometheus_grafana
  - tempo

Let the network run for ~15 epochs and stopped 50% of the validators. Let the network run in non finality for many epochs, and started a docker build of this branch

version: "3.9"

services:
  cl-lighthouse-syncer:
    image: "sigp/lighthouse:non-fin"
    command: >
      lighthouse beacon_node
      --debug-level=debug
      --datadir=/data/lighthouse/beacon-data
      --listen-address=0.0.0.0
      --port=9000
      --http
      --http-address=0.0.0.0
      --http-port=4000
      --disable-packet-filter
      --execution-endpoints=http://172.16.0.89:8551
      --jwt-secrets=/jwt/jwtsecret
      --suggested-fee-recipient=0x8943545177806ED17B9F23F0a21ee5948eCaa776
      --disable-enr-auto-update
      --enr-address=172.16.0.21
      --enr-tcp-port=9000
      --enr-udp-port=9000
      --enr-quic-port=9001
      --quic-port=9001
      --metrics
      --metrics-address=0.0.0.0
      --metrics-allow-origin=*
      --metrics-port=5054
      --enable-private-discovery
      --testnet-dir=/network-configs
      --boot-nodes=enr:-OK4QHWvCmwiaEj8437Z6Wlk32gLVM5Hbw9n6PesII42toDOPmdquevxog8OS8SMMru3VjRvo3qOk80qCXzOUEa8ecoDh2F0dG5ldHOIAAAAAAAAAMCGY2xpZW501opMaWdodGhvdXNlijguMC4wLXJjLjKEZXRoMpASBoYoYAAAOP__________gmlkgnY0gmlwhKwQABKEcXVpY4IjKYlzZWNwMjU2azGhAmc8-hvS_9yO5fBwlBhgTYVDSdOtFJW7uVpTmYkVcZBWiHN5bmNuZXRzAIN0Y3CCIyiDdWRwgiMo
      --target-peers=3
      --execution-timeout-multiplier=3
      --checkpoint-block=/blocks/block_640.ssz
      --checkpoint-state=/blocks/state_640.ssz
      --checkpoint-blobs=/blocks/blobs_640.ssz
    environment:
      - RUST_BACKTRACE=full
    extra_hosts:
      # Allow container to reach host service (Linux-compatible)
      - "host.docker.internal:host-gateway"
    volumes:
      - configs:/network-configs
      - jwt:/jwt
      - /root/kurtosis-non-fin/blocks:/blocks
    ports:
      - "33400:4000/tcp"   # HTTP API
      - "33554:5054/tcp"   # Metrics
      - "33900:9000/tcp"   # Libp2p TCP
      - "33900:9000/udp"   # Libp2p UDP
      - "33901:9001/udp"   # QUIC
    networks:
      kt:
        ipv4_address: 172.16.0.88
    shm_size: "64m"

  lcli-mock-el:
    image: sigp/lcli
    command: >
      lcli mock-el
      --listen-address 0.0.0.0
      --listen-port 8551
      --jwt-output-path=/jwt/jwtsecret
    volumes:
      - jwt:/jwt
    ports:
      - "33851:8551"
    networks:
      kt:
        ipv4_address: 172.16.0.89

networks:
  kt:
    external: true
    name: kt-quiet-crater

volumes:
  configs:
    name: files-artifact-expansion--e56f64e9c6aa4409b27b11e37d1ab4d3--bc0964a8b6c54745ba6473aaa684a81e
    external: true
  jwt:
    name: files-artifact-expansion--870bc5edd3eb44598f50a70ada54cd31--bc0964a8b6c54745ba6473aaa684a81e
    external: true

All logs below are from the cl-lighthouse-syncer container

The node starts with checkpoint sync at a more recent checkpoint than latest finalized, specifically epoch 20. The node range synced to head without issues. See log Synced and finalized_checkpoint: 0xc1edeaf0997ead34936cf20372084f0348ffb79366d453c19f2ef0a1536e766a/15/local/0x44a053199e37647e4dd6a21ad2def14d17e0af13ea9ba9d467a3dc99fad817a2/20

Then I triggered manual finalization into a more recent non-finalized checkpoint. It now logs finalized_checkpoint: 0xc1edeaf0997ead34936cf20372084f0348ffb79366d453c19f2ef0a1536e766a/15/local/0xe155e7846c20f1db50d53e9257c9eaa48c07dc5426312aded75be46672c3d022/727

Oct 25 13:42:24.501 INFO  Synced                                        peers: "3", exec_hash: "0x1ec25768c49daf9edf41a93a23ce3c2419c0412d016a0a32b135922947dd91a0 (unverified)", finalized_checkpoint: 0xc1edeaf0997ead34936cf20372084f0348ffb79366d453c19f2ef0a1536e766a/15/local/0x44a053199e37647e4dd6a21ad2def14d17e0af13ea9ba9d467a3dc99fad817a2/20, epoch: 732, block: "0x2e69210f7caba54c127b79b1959ab5fdb9fe04948365ce4a46d44e1f103fd7e1", slot: 23439
Oct 25 13:42:27.127 DEBUG Processed HTTP API request                    elapsed_ms: 63.17743300000001, status: 200 OK, path: /lighthouse/finalize, method: POST
Oct 25 13:42:27.501 INFO  Synced                                        peers: "3", exec_hash: "0x1ec25768c49daf9edf41a93a23ce3c2419c0412d016a0a32b135922947dd91a0 (unverified)", finalized_checkpoint: 0xc1edeaf0997ead34936cf20372084f0348ffb79366d453c19f2ef0a1536e766a/15/local/0xe155e7846c20f1db50d53e9257c9eaa48c07dc5426312aded75be46672c3d022/727, epoch: 732, block: "   …  empty", slot: 23440
Oct 25 13:42:27.668 DEBUG Starting database pruning                     split_prior_to_migration: Split { slot: Slot(640), state_root: 0x890ac4381ba8306416f7cc2c8af52d7af0292b5a5b78e429921c98d917256d1b, block_root: 0x44a053199e37647e4dd6a21ad
2def14d17e0af13ea9ba9d467a3dc99fad817a2 }, new_finalized_checkpoint: Checkpoint { epoch: Epoch(727), root: 0xe155e7846c20f1db50d53e9257c9eaa48c07dc5426312aded75be46672c3d022 }, new_finalized_state_root: 0x61dd07935440fe95b50a5bad21760d8ae96cc
58439ba6047b5c41caa89ee06f6
Oct 25 13:42:27.878 DEBUG Extra pruning information                     new_finalized_checkpoint: Checkpoint { epoch: Epoch(727), root: 0xe155e7846c20f1db50d53e9257c9eaa48c07dc5426312aded75be46672c3d022 }, new_finalized_state_root: 0x61dd0793
5440fe95b50a5bad21760d8ae96cc58439ba6047b5c41caa89ee06f6, split_prior_to_migration: Split { slot: Slot(640), state_root: 0x890ac4381ba8306416f7cc2c8af52d7af0292b5a5b78e429921c98d917256d1b, block_root: 0x44a053199e37647e4dd6a21ad2def14d17e0af1
3ea9ba9d467a3dc99fad817a2 }, newly_finalized_blocks: 22625, newly_finalized_state_roots: 22625, newly_finalized_states_min_slot: 640, required_finalized_diff_state_slots: [Slot(23264), Slot(23040), Slot(22528), Slot(16384), Slot(640)], kept_s
ummaries_for_hdiff: [(0x890ac4381ba8306416f7cc2c8af52d7af0292b5a5b78e429921c98d917256d1b, Slot(640)), (0x77de49b559e30ac0ffd9f1fdff87c19a57803fc8036b62f387dcc55885a83f47, Slot(16384)), (0xa99ffc63658d2cd51b8d0f9e59caa755f376cd46e1d08ecc0acc9f
f77eab4192, Slot(22528)), (0x5d8a9d02d3f88d1c8a51f8ed533933e593bee96f335756336bbc654c4a898542, Slot(23040))], state_summaries_count: 22989, state_summaries_dag_roots: [(0x890ac4381ba8306416f7cc2c8af52d7af0292b5a5b78e429921c98d917256d1b, DAGSt
ateSummary { slot: Slot(640), latest_block_root: 0x44a053199e37647e4dd6a21ad2def14d17e0af13ea9ba9d467a3dc99fad817a2, latest_block_slot: Slot(640), previous_state_root: 0xc210447f3b63eca6d76073d7f647a22a069b6c69c0dd275a723f2ed8acf566fd })], fi
nalized_and_descendant_state_roots_of_finalized_checkpoint: 277, blocks_to_prune: 0, states_to_prune: 22708
Oct 25 13:42:28.162 DEBUG Database pruning complete                     new_finalized_state_root: 0x61dd07935440fe95b50a5bad21760d8ae96cc58439ba6047b5c41caa89ee06f6
Oct 25 13:42:28.164 INFO  Starting database compaction                  old_finalized_epoch: 20, new_finalized_epoch: 727

Then I restarted the validators and the network finalized. See that the node pruned from the latest manual finalization. It now logs finalized_checkpoint: 0x13700c4b236867eb7bf1a6752fe6729118cfcf5d70f5bc1916e70ecea542d01a/736

Oct 25 13:51:14.585 DEBUG Starting database pruning                     split_prior_to_migration: Split { slot: Slot(23264), state_root: 0x61dd07935440fe95b50a5bad21760d8ae96cc58439ba6047b5c41caa89ee06f6, block_root: 0xe155e7846c20f1db50d53e9257c9eaa48c07dc5426312aded75be46672c3d022 }, new_finalized_checkpoint: Checkpoint { epoch: Epoch(736), root: 0x13700c4b236867eb7bf1a6752fe6729118cfcf5d70f5bc1916e70ecea542d01a }, new_finalized_state_root: 0x8de7833fb5b0a73211982b32bfa5c1e7b79154386150a04d6be60afd62b92988
qOct 25 13:53:51.500 INFO  Synced                                        peers: "3", exec_hash: "0x875c7aa7b85bd792a7842858a34c507e70f5eb05af80cfe170c891cacbe15c19 (unverified)", finalized_checkpoint: 0x13700c4b236867eb7bf1a6752fe6729118cfcf5d70f5bc1916e70ecea542d01a/736, epoch: 739, block: "0x70782ea183f9c5edfbf9709e9728b2a8c775b2f72e70b9e715f1d815cc236471", slot: 23668

Then I stopped 50% of the validators again and triggered manual finalization. See that it transitions from finalized_checkpoint: 0x7c51f4c364f60561e9b39931264b54c7656dd74fb99c5a048c09c16126bb70ce/740 to finalized_checkpoint: 0x7c51f4c364f60561e9b39931264b54c7656dd74fb99c5a048c09c16126bb70ce/740/local/0xff5d2b08ebaf5bec69c1e9251d9ad788c6350e0a6ece69a27b13fa477af76729/746

Oct 25 14:09:24.500 INFO  Synced                                        peers: "3", exec_hash: "0xa106339a1ccf80a10e1c5cdd2b50e8d5c58cb87f66c2574527d054d710886a5c (unverified)", finalized_checkpoint: 0x7c51f4c364f60561e9b39931264b54c7656dd74fb99c5a048c09c16126bb70ce/740, epoch: 749, block: "   …  empty", slot: 23979

Oct 25 14:09:26.157 DEBUG Processed HTTP API request                    elapsed_ms: 0.6780459999999999, status: 200 OK, path: /lighthouse/finalize, method: POST
Oct 25 14:09:27.111 DEBUG Starting database pruning                     split_prior_to_migration: Split { slot: Slot(23680), state_root: 0xce62a2b2dd6f00d9aacb3469050dcd1d1035e84457adba3d85f6142be3a2013c, block_root: 0x7c51f4c364f60561e9b39931264b54c7656dd74fb99c5a048c09c16126bb70ce }, new_finalized_checkpoint: Checkpoint { epoch: Epoch(746), root: 0xff5d2b08ebaf5bec69c1e9251d9ad788c6350e0a6ece69a27b13fa477af76729 }, new_finalized_state_root: 0x77c39c0504220d7a563a9403e25a05ccf46d7294c4282787f564559a191a6f2d

Oct 25 14:09:27.504 INFO  Synced                                        peers: "3", exec_hash: "0xa106339a1ccf80a10e1c5cdd2b50e8d5c58cb87f66c2574527d054d710886a5c (unverified)", finalized_checkpoint: 0x7c51f4c364f60561e9b39931264b54c7656dd74fb99c5a048c09c16126bb70ce/740/local/0xff5d2b08ebaf5bec69c1e9251d9ad788c6350e0a6ece69a27b13fa477af76729/746, epoch: 749, block: "   …  empty", slot: 23980

Restart the validators, and the network finalizes again

Oct 25 14:21:39.500 INFO  Synced                                        peers: "3", exec_hash: "0x2ed485a642ad02a2f11208c8af1aafbc7f2e961344fb47424c61d64fee71f43e (unverified)", finalized_checkpoint: 0x67160358278de47d9d7a4c88bd07391c7470edc4a431047145d4edd183de7915/755, epoch: 757, block: "0xca984edf928634d9f7460b1cfe64ed73b80f99d8ed8cebda2a752e9a6cb3c995", slot: 24224

@dapplion
Copy link
Owner Author

Notes: I tested manual finalization and checkpoint syncing into only blocks that are first in epoch. In prior tests using non-aligned blocks broke, and I still don't know the reason

unrealized_justified_state_root: justified_state_root,
unrealized_finalized_checkpoint: finalized_checkpoint,
justified_state_root: anchor_state_root,
finalized_checkpoint: finalized_checkpoint_on_chain,
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very relevant change: now fork-choice is initialized to the network's finalized and justified checkpoints. Previously we initialized always to a "dummy" checkpoint derived from the anchor state. That "dummy" checkpoint was correct as we expected the anchor state to be exactly the finalized state.

With this change the initial checkpoints have a root for which we don't have a ProtoNode available. This is fine, see fork-choice diff

michaelsproul and others added 12 commits November 5, 2025 02:08
This is an optimisation targeted at Fulu networks in non-finality.

While debugging on Holesky, we found that `state_root_at_slot` was being called from `prepare_beacon_proposer` a lot, for the finalized state:

https://github.com/sigp/lighthouse/blob/2c9b670f5d313450252c6cb40a5ee34802d54fef/beacon_node/http_api/src/lib.rs#L3860-L3861

This was causing `prepare_beacon_proposer` calls to take upwards of 5 seconds, sometimes 10 seconds, because it would trigger _multiple_ beacon state loads in order to iterate back to the finalized slot. Ideally, loading the finalized state should be quick because we keep it cached in the state cache (technically we keep the split state, but they usually coincide). Instead we are computing the finalized state root separately (slow), and then loading the state from the cache (fast).

Although it would be possible to make the API faster by removing the `state_root_at_slot` call, I believe it's simpler to change `state_root_at_slot` itself and remove the footgun. Devs rightly expect operations involving the finalized state to be fast.


Co-Authored-By: Michael Sproul <michael@sigmaprime.io>
Remove all Windows-related CI jobs


  


Co-Authored-By: antondlr <anton@sigmaprime.io>
Co-Authored-By: Tan Chee Keong <tanck@sigmaprime.io>

Co-Authored-By: Michael Sproul <michaelsproul@users.noreply.github.com>
while working on this sigp#7892 @michaelsproul pointed it might be a good metric to measure the delay from start of the slot instead of the current `slot_duration / 3`, since the attestations duties start before the `1/3rd` mark now with the change in the link PR.


Co-Authored-By: hopinheimer <knmanas6@gmail.com>

Co-Authored-By: hopinheimer <48147533+hopinheimer@users.noreply.github.com>
### Downgrade a non error to `Debug`

I noticed this error on one of our hoodi nodes:

```
Nov 04 05:13:38.892 ERROR Error during data column reconstruction       block_root: 0x4271b9efae7deccec3989bd2418e998b83ce8144210c2b17200abb62b7951190, error: DuplicateFullyImported(0x4271b9efae7deccec3989bd2418e998b83ce8144210c2b17200abb62b7951190)
```

This shouldn't be logged as an error and it's due to a normal race condition, and it doesn't impact the node negatively.

### Remove spammy logs

This logs is filling up the log files quite quickly and it is also something we'd expect during normal operation - getting columns via EL before gossip. We haven't found this debug log to be useful, so I propose we remove it to avoid spamming debug logs.

```
Received already available column sidecar. Ignoring the column sidecar
```

In the process of removing this, I noticed we aren't propagating the validation result, which I think we should so I've added this. The impact should be quite minimal - the message will stay in the gossip memcache for a bit longer but should be evicted in the next heartbeat.


  


Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>
Another good candidate for publishing separately from Lighthouse is `sensitive_url` as it's a general utility crate and not related to Ethereum. This PR prepares it to be spun out into its own crate.


  I've made the `full` field on `SensitiveUrl` private and instead provided an explicit getter called `.expose_full()`. It's a bit ugly for the diff but I prefer the explicit nature of the getter.
I've also added some extra tests and doc strings along with feature gating `Serialize` and `Deserialize` implementations behind the `serde` feature.


Co-Authored-By: Mac L <mjladson@pm.me>
This compiles, is there any reason to keep `ecdsa`? CC @jxs


Co-Authored-By: Michael Sproul <michael@sigmaprime.io>
Self hosted GitHub Runners review and improvements


  local testnet workflow now uses warpbuild ci runner


Co-Authored-By: lemon <snyxmk@gmail.com>

Co-Authored-By: antondlr <anton@sigmaprime.io>
Use the recently published `sensitive_url` and remove it from Lighthouse


Co-Authored-By: Mac L <mjladson@pm.me>
Fixes sigp#7001.


  Mostly mechanical replacement of `derivative` attributes with `educe` ones.

### **Attribute Syntax Changes**

```rust
// Bounds: = "..." → (...)
#[derivative(Hash(bound = "E: EthSpec"))]
#[educe(Hash(bound(E: EthSpec)))]

// Ignore: = "ignore" → (ignore)
#[derivative(PartialEq = "ignore")]
#[educe(PartialEq(ignore))]

// Default values: value = "..." → expression = ...
#[derivative(Default(value = "ForkName::Base"))]
#[educe(Default(expression = ForkName::Base))]

// Methods: format_with/compare_with = "..." → method(...)
#[derivative(Debug(format_with = "fmt_peer_set_as_len"))]
#[educe(Debug(method(fmt_peer_set_as_len)))]

// Empty bounds: removed entirely, educe can infer appropriate bounds
#[derivative(Default(bound = ""))]
#[educe(Default)]

// Transparent debug: manual implementation (educe doesn't support it)
#[derivative(Debug = "transparent")]
// Replaced with manual Debug impl that delegates to inner field
```

**Note**: Some bounds use strings (`bound("E: EthSpec")`) for superstruct compatibility (`expected ','` errors).


Co-Authored-By: Javier Chávarri <javier.chavarri@gmail.com>

Co-Authored-By: Mac L <mjladson@pm.me>
@dapplion dapplion closed this Nov 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants