fix: optimize the signing of attestations#834
fix: optimize the signing of attestations#834petarjuki7 wants to merge 8 commits intosigp:unstablefrom
Conversation
|
Claude finished @petarjuki7's task in 2m 30s —— View job Code Review CompleteThis PR implements a solid optimization by batching attestation signing at the SSV committee level. The refactoring correctly identifies that validators in the same committee can share a single QBFT consensus decision, reducing redundant consensus rounds. The code is well-structured and the parallel processing approach is appropriate. However, there are several issues that need to be addressed before merging. Critical Issues1. Panic Risk: Unsafe Array Indexing
|
anchor/validator_store/src/lib.rs
Outdated
| let (validator, cluster) = self.get_validator_and_cluster(validator_pubkey)?; | ||
| let voting_context_tx = self.get_voting_context(attestation.data().slot).await?; | ||
| // Look up cluster for QBFT consensus (all validators share the same cluster) | ||
| let (_, cluster) = self.get_validator_and_cluster(attestations[0].1)?; |
There was a problem hiding this comment.
Assumption to Verify: This assumes all validators grouped by committee_id belong to the same SSV cluster. Is this guaranteed by the grouping logic in sign_attestations? If a committee could contain validators from different clusters, using attestations[0] would be incorrect.
Consider adding a debug assertion or comment documenting this invariant.
There was a problem hiding this comment.
// All validators in the same committee share the same cluster is in fact incorrect.
But the cluster is actually not needed, if you modify decide_instance. Decide instance actually only needs the cluster members - which are in fact equal to the committee members. :)
anchor/validator_store/src/lib.rs
Outdated
| signing_root, | ||
| slot, | ||
| ) | ||
| .await?; |
There was a problem hiding this comment.
attestation validation is very time sensitive. in this loop over attestations, we .await each signature before moving onto the next, i.e. sequential processing, whereas it was being done in parallel via join_all before. we could turn this back into parallel futures with join_all as a fix.
I think at that point though, we'd just be recreating the pre-PR code, but with some extra steps and complexity.
other thoughts:
The gains from committee grouping are fewer calls to get_validator_and_cluster, get_voting_context, and similar, but these are in-memory lookups. And decide_instance is already engineered to handle multiple calls to the same instance gracefully.
Given that, I'm not sure the added complexity
(sign_committee_attestations, committee grouping HashMap, AttestationData type alias) justifies itself. What do you think about keeping the per-validator approach? And def lmk if i'm overlooking some solid gains we'd get from this PR
There was a problem hiding this comment.
Thanks for the comment! The original idea was that operations like deciding on a vote and collecting the signatures are preformed on a per committee basis, it would make sense to group them by committee. Also you wouldn't need the QBFT instance per validator as its the same for everyone in the committee. But I agree decide_instance is engineered to handle multiple calls to the same instance.
Although some overhead is avoided like fewer DashMap lookups, fewer oneshot channels... etc. but these are probably cheap operations.
Would like to hear @dknopik's input, if we agree the issue wasn't that significant I am okay with closing the PR.
There was a problem hiding this comment.
Actually this made me realize that the Pre-PR code is also flawed: if even one involved SSV committee is not able to sign (e.g. due to offline operators), the other signing futures will complete but the results will not be processed until those other signings time out. :/
| let completed = self | ||
| .qbft_manager | ||
| .decide_instance( | ||
| CommitteeInstanceId { | ||
| committee: cluster.committee_id(), | ||
| instance_height: attestation.data().slot.as_usize().into(), | ||
| committee: committee_id, | ||
| instance_height: slot.as_usize().into(), | ||
| }, | ||
| BeaconVote { | ||
| block_root: attestation.data().beacon_block_root, | ||
| source: attestation.data().source, | ||
| target: attestation.data().target, | ||
| block_root: first_att_data.beacon_block_root, | ||
| source: first_att_data.source, | ||
| target: first_att_data.target, | ||
| }, | ||
| self.create_beacon_vote_validator( | ||
| attestation.data().slot, | ||
| validator_attestation_committees, | ||
| ), | ||
| self.create_beacon_vote_validator(slot, validator_attestation_committees), | ||
| timeout_mode, | ||
| &cluster, | ||
| ) | ||
| .await | ||
| .map_err(SpecificError::from)?; | ||
| drop(timer); | ||
|
|
||
| let data = match completed { | ||
| Completed::TimedOut => return Err(Error::SpecificError(SpecificError::Timeout)), | ||
| Completed::Success(data) => data, | ||
| }; |
There was a problem hiding this comment.
We can move this outside the loop
anchor/validator_store/src/lib.rs
Outdated
| let (validator, cluster) = self.get_validator_and_cluster(validator_pubkey)?; | ||
| let voting_context_tx = self.get_voting_context(attestation.data().slot).await?; | ||
| // Look up cluster for QBFT consensus (all validators share the same cluster) | ||
| let (_, cluster) = self.get_validator_and_cluster(attestations[0].1)?; |
There was a problem hiding this comment.
// All validators in the same committee share the same cluster is in fact incorrect.
But the cluster is actually not needed, if you modify decide_instance. Decide instance actually only needs the cluster members - which are in fact equal to the committee members. :)
anchor/validator_store/src/lib.rs
Outdated
| let signature = match self | ||
| .collect_signature( | ||
| PartialSignatureKind::PostConsensus, | ||
| Role::Committee, | ||
| CollectionMode::Committee { | ||
| num_signatures_to_collect, | ||
| base_hash: data_hash, | ||
| }, | ||
| &validator, | ||
| &cluster, | ||
| signing_root, | ||
| slot, | ||
| ) | ||
| .await |
There was a problem hiding this comment.
This will block as it will wait for collect_signature calls for the other validators - which will happen in the next iterations, so this deadlocks.
Therefore, we need a variant that can register for multiple validators.
Likely, this is why the testnet fails.
Issue Addressed
Addresses issue #819
Proposed Changes
Additional Info
Follow-up to #813. The sign_attestations batch signing was delegating to single sign_attestation calls which is inefficient since validators in the same committee share the same QBFT decision.