Downscore and retry custody failures#7510
Downscore and retry custody failures#7510dapplion wants to merge 29 commits intopeerdas-devnet-7from
Conversation
| } | ||
| for (id, result) in self.network.continue_custody_by_range_requests() { | ||
| self.on_custody_by_range_result(id, result); | ||
| } |
There was a problem hiding this comment.
Every interval (15 sec), we call continue_custody_by_range / by_root requests, which will cause the request to error if it has been alive for too long. This allows the requests to not error immediately if they do not have enough custody peers.
| } | ||
| for (id, result) in self.network.continue_custody_by_range_requests() { | ||
| self.on_custody_by_range_result(id, result); | ||
| } |
There was a problem hiding this comment.
Every time a peer joins, attempt to progress custody_by_root and custody_by_range requests
| //! Stores the various syncing methods for the beacon chain. | ||
| mod backfill_sync; | ||
| mod block_lookups; | ||
| mod block_sidecar_coupling; |
There was a problem hiding this comment.
Logic moved to beacon_node/network/src/sync/network_context/block_components_by_range.rs
| ForkContext, Hash256, SignedBeaconBlock, SignedBeaconBlockHeader, Slot, | ||
| }; | ||
|
|
||
| pub mod custody; |
There was a problem hiding this comment.
Renamed existing custody module to custody_by_root and added a new one custody_by_range
| } | ||
|
|
||
| /// Returns the ids of all active requests | ||
| pub fn active_requests(&mut self) -> impl Iterator<Item = (SyncRequestId, &PeerId)> { |
There was a problem hiding this comment.
Changed this signature for tests, to have access to all active RPC requests
|
|
||
| Ok((requests, column_to_peer_map)) | ||
| }) | ||
| .transpose()?; |
There was a problem hiding this comment.
network_context no longer spawns _by_range requests, this logic is now inside BlockComponentsByRangeRequest
| blocks_by_range_request: | ||
| ByRangeRequest<BlocksByRangeRequestId, Vec<Arc<SignedBeaconBlock<E>>>>, | ||
| blobs_by_range_request: ByRangeRequest<BlobsByRangeRequestId, Vec<Arc<BlobSidecar<E>>>>, | ||
| }, |
There was a problem hiding this comment.
Maintains the same behaviour for mainnet:
- deneb: issue blocks + blobs requests at the same time
- fulu: issue blocks request first, then columns
| #[derive(Debug, PartialEq, Eq)] | ||
| pub enum RpcRequestSendError { | ||
| /// No peer available matching the required criteria | ||
| NoPeer(NoPeerError), |
There was a problem hiding this comment.
Requests do not error now on send if they don't have peers. Instead, custody_by_root and custody_by_range requests are left idle for some time, expecting peers.
jimmygchen
left a comment
There was a problem hiding this comment.
@dapplion I've started reviewing this but haven't got to the meat of the changes - will continue tomorrow but I've submitted my comments so far.
| blocks: Vec<RpcBlock<T::EthSpec>>, | ||
| ) -> ProcessingResult { | ||
| // Account for one more requests to this peer | ||
| // TODO(das): this code assumes that we do a single request per peer per RpcBlock |
There was a problem hiding this comment.
This would be true in the below case:
peerA: block
peerB: col_1, col_2
peerC: col_2, col_3,
we get 1 req for each peer.
BUT for this below scenario:
peerA: block, col_5
peerB: col_1, col_2
peerC: col_2, col_3,
we still get 1 for each peer, which isn't fully correct.
Or we could just use BatchPeers and count block peer separately and correctly.
Either way I don't think it makes a big difference - I think we can get rid of this TODO and document the assumptions.
| batch.download_failed(Some(*peer_id))? | ||
| // TODO(das): Is it necessary for the batch to track failed peers? Can we make this | ||
| // mechanism compatible with PeerDAS and before PeerDAS? | ||
| batch.download_failed(None)? |
There was a problem hiding this comment.
We still the peer to track failed block peers, and this will break the current peer prioritisation for block and blobs. I think this needs a bit more thought.
There was a problem hiding this comment.
After discussing with @pawanjay176 we decided to not track download failures. We rely on randomness to select a different peer as we expect the syncing chains to have a decent number of peers. For the custody by range requests there's an internal failed_peers set that will de-prioritize peers with network errors. The failed_peers at the batch level is still used to track peers that sent blocks or columns that failed processing and ensure that we retry from a different peer
|
I've added a few more comments. I've spent quite a bit of time reading but I'm really struggling with reviewing this in the current state, with potential missing pieces and a bunch of outstanding I think it might be useful to go through the plan and code together with @pawanjay176, or re-review this once this is complete. What do you think? |
692ad81 to
42ef88b
Compare
af3d678 to
c6b39e9
Compare
|
I've just merged latest I've triggerd a CI workflow to run the latest checkpoint sync test on this PR: https://github.com/jimmygchen/lighthouse/actions/runs/15412363885 |
|
Looks like it's struggling with backfill - below link is a test run on sepolia - both nodes transitioned to On PeerDAS, supernode struggles to sync to head, but fullnode did complete backfilling 1000 slots in 506 seconds: Logs are available on the workflow summary page: https://github.com/jimmygchen/lighthouse/actions/runs/15412363885 |
|
@dapplion Would you mind taking a look at the failing CI tests in the PR? |
…p potentially faulty peers
|
Not sure if this is ready for review? Its crashing on mainnet with |
|
Hi @dapplion, this pull request has been closed automatically due to 30 days of inactivity. If you’d like to continue working on it, feel free to reopen at any time. |
|
Issue Addressed
Partially addresses
Proposed Changes
TBD
1000 lines are for new unit tests :)
Questions / TODO
Tests to add
Future TODOs
subscribe_all_data_column_subnetsto determineis_supernodeon_xxx_response, suggestion -> Downscore and retry custody failures #7510 (comment)add_peer_*functions in sync tests. Best to do in another PR as it touches a lot of lookup sync tests (unrelated)requests_per_peerto ensure that eventually we fetch data from all peers in the chain