Force consumers to decide what finality checkpoints they want to read#52
Force consumers to decide what finality checkpoints they want to read#52
Conversation
bc235b2 to
51f0968
Compare
51f0968 to
1b90859
Compare
Test of ff431acStarted a kurtosis network with 6 participants on latest unstable participants:
- el_type: geth
el_image: ethereum/client-go:latest
cl_type: lighthouse
cl_image: sigp/lighthouse:latest-unstable
cl_extra_params:
- --target-peers=7
vc_extra_params:
- --use-long-timeouts
- --long-timeouts-multiplier=3
count: 6
validator_count: 16
network_params:
electra_fork_epoch: 0
seconds_per_slot: 3
genesis_delay: 400
global_log_level: debug
snooper_enabled: false
additional_services:
- dora
- spamoor
- prometheus_grafana
- tempoLet the network run for ~15 epochs and stopped 50% of the validators. Let the network run in non finality for many epochs, and started a docker build of this branch version: "3.9"
services:
cl-lighthouse-syncer:
image: "sigp/lighthouse:non-fin"
command: >
lighthouse beacon_node
--debug-level=debug
--datadir=/data/lighthouse/beacon-data
--listen-address=0.0.0.0
--port=9000
--http
--http-address=0.0.0.0
--http-port=4000
--disable-packet-filter
--execution-endpoints=http://172.16.0.89:8551
--jwt-secrets=/jwt/jwtsecret
--suggested-fee-recipient=0x8943545177806ED17B9F23F0a21ee5948eCaa776
--disable-enr-auto-update
--enr-address=172.16.0.21
--enr-tcp-port=9000
--enr-udp-port=9000
--enr-quic-port=9001
--quic-port=9001
--metrics
--metrics-address=0.0.0.0
--metrics-allow-origin=*
--metrics-port=5054
--enable-private-discovery
--testnet-dir=/network-configs
--boot-nodes=enr:-OK4QHWvCmwiaEj8437Z6Wlk32gLVM5Hbw9n6PesII42toDOPmdquevxog8OS8SMMru3VjRvo3qOk80qCXzOUEa8ecoDh2F0dG5ldHOIAAAAAAAAAMCGY2xpZW501opMaWdodGhvdXNlijguMC4wLXJjLjKEZXRoMpASBoYoYAAAOP__________gmlkgnY0gmlwhKwQABKEcXVpY4IjKYlzZWNwMjU2azGhAmc8-hvS_9yO5fBwlBhgTYVDSdOtFJW7uVpTmYkVcZBWiHN5bmNuZXRzAIN0Y3CCIyiDdWRwgiMo
--target-peers=3
--execution-timeout-multiplier=3
--checkpoint-block=/blocks/block_640.ssz
--checkpoint-state=/blocks/state_640.ssz
--checkpoint-blobs=/blocks/blobs_640.ssz
environment:
- RUST_BACKTRACE=full
extra_hosts:
# Allow container to reach host service (Linux-compatible)
- "host.docker.internal:host-gateway"
volumes:
- configs:/network-configs
- jwt:/jwt
- /root/kurtosis-non-fin/blocks:/blocks
ports:
- "33400:4000/tcp" # HTTP API
- "33554:5054/tcp" # Metrics
- "33900:9000/tcp" # Libp2p TCP
- "33900:9000/udp" # Libp2p UDP
- "33901:9001/udp" # QUIC
networks:
kt:
ipv4_address: 172.16.0.88
shm_size: "64m"
lcli-mock-el:
image: sigp/lcli
command: >
lcli mock-el
--listen-address 0.0.0.0
--listen-port 8551
--jwt-output-path=/jwt/jwtsecret
volumes:
- jwt:/jwt
ports:
- "33851:8551"
networks:
kt:
ipv4_address: 172.16.0.89
networks:
kt:
external: true
name: kt-quiet-crater
volumes:
configs:
name: files-artifact-expansion--e56f64e9c6aa4409b27b11e37d1ab4d3--bc0964a8b6c54745ba6473aaa684a81e
external: true
jwt:
name: files-artifact-expansion--870bc5edd3eb44598f50a70ada54cd31--bc0964a8b6c54745ba6473aaa684a81e
external: trueAll logs below are from the The node starts with checkpoint sync at a more recent checkpoint than latest finalized, specifically epoch 20. The node range synced to head without issues. See log Then I triggered manual finalization into a more recent non-finalized checkpoint. It now logs Then I restarted the validators and the network finalized. See that the node pruned from the latest manual finalization. It now logs Then I stopped 50% of the validators again and triggered manual finalization. See that it transitions from Restart the validators, and the network finalizes again |
|
Notes: I tested manual finalization and checkpoint syncing into only blocks that are first in epoch. In prior tests using non-aligned blocks broke, and I still don't know the reason |
| unrealized_justified_state_root: justified_state_root, | ||
| unrealized_finalized_checkpoint: finalized_checkpoint, | ||
| justified_state_root: anchor_state_root, | ||
| finalized_checkpoint: finalized_checkpoint_on_chain, |
There was a problem hiding this comment.
Very relevant change: now fork-choice is initialized to the network's finalized and justified checkpoints. Previously we initialized always to a "dummy" checkpoint derived from the anchor state. That "dummy" checkpoint was correct as we expected the anchor state to be exactly the finalized state.
With this change the initial checkpoints have a root for which we don't have a ProtoNode available. This is fine, see fork-choice diff
This is an optimisation targeted at Fulu networks in non-finality. While debugging on Holesky, we found that `state_root_at_slot` was being called from `prepare_beacon_proposer` a lot, for the finalized state: https://github.com/sigp/lighthouse/blob/2c9b670f5d313450252c6cb40a5ee34802d54fef/beacon_node/http_api/src/lib.rs#L3860-L3861 This was causing `prepare_beacon_proposer` calls to take upwards of 5 seconds, sometimes 10 seconds, because it would trigger _multiple_ beacon state loads in order to iterate back to the finalized slot. Ideally, loading the finalized state should be quick because we keep it cached in the state cache (technically we keep the split state, but they usually coincide). Instead we are computing the finalized state root separately (slow), and then loading the state from the cache (fast). Although it would be possible to make the API faster by removing the `state_root_at_slot` call, I believe it's simpler to change `state_root_at_slot` itself and remove the footgun. Devs rightly expect operations involving the finalized state to be fast. Co-Authored-By: Michael Sproul <michael@sigmaprime.io>
Remove all Windows-related CI jobs Co-Authored-By: antondlr <anton@sigmaprime.io>
Co-Authored-By: Tan Chee Keong <tanck@sigmaprime.io> Co-Authored-By: Michael Sproul <michaelsproul@users.noreply.github.com>
while working on this sigp#7892 @michaelsproul pointed it might be a good metric to measure the delay from start of the slot instead of the current `slot_duration / 3`, since the attestations duties start before the `1/3rd` mark now with the change in the link PR. Co-Authored-By: hopinheimer <knmanas6@gmail.com> Co-Authored-By: hopinheimer <48147533+hopinheimer@users.noreply.github.com>
### Downgrade a non error to `Debug` I noticed this error on one of our hoodi nodes: ``` Nov 04 05:13:38.892 ERROR Error during data column reconstruction block_root: 0x4271b9efae7deccec3989bd2418e998b83ce8144210c2b17200abb62b7951190, error: DuplicateFullyImported(0x4271b9efae7deccec3989bd2418e998b83ce8144210c2b17200abb62b7951190) ``` This shouldn't be logged as an error and it's due to a normal race condition, and it doesn't impact the node negatively. ### Remove spammy logs This logs is filling up the log files quite quickly and it is also something we'd expect during normal operation - getting columns via EL before gossip. We haven't found this debug log to be useful, so I propose we remove it to avoid spamming debug logs. ``` Received already available column sidecar. Ignoring the column sidecar ``` In the process of removing this, I noticed we aren't propagating the validation result, which I think we should so I've added this. The impact should be quite minimal - the message will stay in the gossip memcache for a bit longer but should be evicted in the next heartbeat. Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>
Another good candidate for publishing separately from Lighthouse is `sensitive_url` as it's a general utility crate and not related to Ethereum. This PR prepares it to be spun out into its own crate. I've made the `full` field on `SensitiveUrl` private and instead provided an explicit getter called `.expose_full()`. It's a bit ugly for the diff but I prefer the explicit nature of the getter. I've also added some extra tests and doc strings along with feature gating `Serialize` and `Deserialize` implementations behind the `serde` feature. Co-Authored-By: Mac L <mjladson@pm.me>
This compiles, is there any reason to keep `ecdsa`? CC @jxs Co-Authored-By: Michael Sproul <michael@sigmaprime.io>
Self hosted GitHub Runners review and improvements local testnet workflow now uses warpbuild ci runner Co-Authored-By: lemon <snyxmk@gmail.com> Co-Authored-By: antondlr <anton@sigmaprime.io>
Use the recently published `sensitive_url` and remove it from Lighthouse Co-Authored-By: Mac L <mjladson@pm.me>
Fixes sigp#7001. Mostly mechanical replacement of `derivative` attributes with `educe` ones. ### **Attribute Syntax Changes** ```rust // Bounds: = "..." → (...) #[derivative(Hash(bound = "E: EthSpec"))] #[educe(Hash(bound(E: EthSpec)))] // Ignore: = "ignore" → (ignore) #[derivative(PartialEq = "ignore")] #[educe(PartialEq(ignore))] // Default values: value = "..." → expression = ... #[derivative(Default(value = "ForkName::Base"))] #[educe(Default(expression = ForkName::Base))] // Methods: format_with/compare_with = "..." → method(...) #[derivative(Debug(format_with = "fmt_peer_set_as_len"))] #[educe(Debug(method(fmt_peer_set_as_len)))] // Empty bounds: removed entirely, educe can infer appropriate bounds #[derivative(Default(bound = ""))] #[educe(Default)] // Transparent debug: manual implementation (educe doesn't support it) #[derivative(Debug = "transparent")] // Replaced with manual Debug impl that delegates to inner field ``` **Note**: Some bounds use strings (`bound("E: EthSpec")`) for superstruct compatibility (`expected ','` errors). Co-Authored-By: Javier Chávarri <javier.chavarri@gmail.com> Co-Authored-By: Mac L <mjladson@pm.me>
Fork-choice internal checkpoints may not be the ones the network agrees on. If the node starts with checkpoint sync we set the checkpoints to something that does not match the head state at the moment. If checkpoint sync is not used with a finalized checkpoint it can lead to confusion in the code.
This PR makes this oddity explicit. Consumers must choose between the local checkpoints (potentially ahead of the network wide onchain checkpoint), or the actual finality checkpoint of the network.
TODO:
Clarify in Beacon API spec that "finality" tag refers to the network wide finality: https://github.com/ethereum/beacon-APIs/blob/8abac03526770b10ab49be4d186d468629127413/params/index.yaml#L10
Also clarify what the finalized hash for the EL means, if network wide or local https://github.com/ethereum/beacon-APIs/blob/8abac03526770b10ab49be4d186d468629127413/params/index.yaml#L10