Skip to content

Comments

fix: optimize the signing of attestations#834

Open
petarjuki7 wants to merge 8 commits intosigp:unstablefrom
petarjuki7:sign_attestations_optimization
Open

fix: optimize the signing of attestations#834
petarjuki7 wants to merge 8 commits intosigp:unstablefrom
petarjuki7:sign_attestations_optimization

Conversation

@petarjuki7
Copy link
Member

Issue Addressed

Addresses issue #819

Proposed Changes

  • Batch attestation signing by SSV committee - runs QBFT consensus once per committee instead of once per validator
  • Process committees in parallel with join_all
  • Added AttestationData type alias for cleaner code

Additional Info

Follow-up to #813. The sign_attestations batch signing was delegating to single sign_attestation calls which is inefficient since validators in the same committee share the same QBFT decision.

@claude-code-actions-sigp
Copy link

claude-code-actions-sigp bot commented Feb 17, 2026

Claude finished @petarjuki7's task in 2m 30s —— View job


Code Review Complete

This PR implements a solid optimization by batching attestation signing at the SSV committee level. The refactoring correctly identifies that validators in the same committee can share a single QBFT consensus decision, reducing redundant consensus rounds. The code is well-structured and the parallel processing approach is appropriate.

However, there are several issues that need to be addressed before merging.


Critical Issues

1. Panic Risk: Unsafe Array Indexing ⚠️

anchor/validator_store/src/lib.rs:1301-1302 — The function assumes attestations is non-empty and directly indexes [0] without validation. If called with an empty vector, this will panic.

Fix: Add empty vector check at the start of sign_committee_attestations.

2. Redundant Lookups Inside Loop

anchor/validator_store/src/lib.rs:1357 — The function looks up validator and cluster metadata inside the loop for each validator, but all validators in the same committee share the same cluster (retrieved at line 1305).

Impact: Unnecessary O(n) database lookups when O(1) is sufficient for cluster.

3. Missing Error Context for Committee Failures

anchor/validator_store/src/lib.rs:2612-2614 — When a committee fails, the error log doesn't include the committee_id, making debugging difficult when multiple committees are processed in parallel.


Design Concerns

4. Loss of Specific Error Information

The refactoring changes error handling from per-validator to per-committee. Previously, if a validator pubkey was unknown, it logged the specific pubkey. Now, an entire committee batch fails silently with a generic error message, losing diagnostic information.

Suggestion: Consider collecting partial successes within a committee batch.

5. Assumption: All Validators in Same Committee Share Cluster

anchor/validator_store/src/lib.rs:1305 — The code assumes all validators grouped by committee_id belong to the same SSV cluster. This assumption should be validated or documented as an invariant.

Question: Is it guaranteed that all validators with the same committee_id have the same cluster_id?


Performance

6. Performance Improvement Validated

The optimization is sound:

  • Before: N validators → N QBFT consensus rounds
  • After: N validators in M committees → M QBFT consensus rounds (M << N typically)

The use of join_all for parallel committee processing is appropriate.


Testing

7. Missing Tests for New Code Path ⚠️

No tests were added to verify:

  • Committee grouping logic works correctly
  • Empty attestation vector handling
  • Partial failure scenarios (one committee succeeds, another fails)
  • Validators from multiple committees are processed correctly
  • The optimization actually reduces QBFT calls as intended

Recommendation: Add integration tests for the new batching behavior.


Documentation

8. Good Documentation

The doc comment on sign_committee_attestations is clear and the AttestationData<E> type alias improves readability.


Minor Issues

9. Inconsistent Return Type

sign_committee_attestations returns Vec<(u64, Attestation<E>, PublicKeyBytes)>, but the PublicKeyBytes is immediately discarded. Consider simplifying the return type.

10. Non-Deterministic Ordering

The HashMap iteration doesn't guarantee deterministic ordering of committee processing. Consider using BTreeMap if deterministic ordering aids debugging.


Recommendations

Must Fix Before Merge:

  1. Add empty vector check in sign_committee_attestations
  2. Add committee_id to error logs
  3. Add tests for the new code path

Should Consider:
4. Optimize redundant get_validator_and_cluster calls inside loop
5. Document or validate the cluster-sharing assumption
6. Improve partial failure handling

Nice to Have:
7. Use BTreeMap for deterministic ordering
8. Simplify return type to match expected output


I've added inline comments on the specific lines that need attention.

let (validator, cluster) = self.get_validator_and_cluster(validator_pubkey)?;
let voting_context_tx = self.get_voting_context(attestation.data().slot).await?;
// Look up cluster for QBFT consensus (all validators share the same cluster)
let (_, cluster) = self.get_validator_and_cluster(attestations[0].1)?;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assumption to Verify: This assumes all validators grouped by committee_id belong to the same SSV cluster. Is this guaranteed by the grouping logic in sign_attestations? If a committee could contain validators from different clusters, using attestations[0] would be incorrect.

Consider adding a debug assertion or comment documenting this invariant.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// All validators in the same committee share the same cluster is in fact incorrect.

But the cluster is actually not needed, if you modify decide_instance. Decide instance actually only needs the cluster members - which are in fact equal to the committee members. :)

signing_root,
slot,
)
.await?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

attestation validation is very time sensitive. in this loop over attestations, we .await each signature before moving onto the next, i.e. sequential processing, whereas it was being done in parallel via join_all before. we could turn this back into parallel futures with join_all as a fix.

I think at that point though, we'd just be recreating the pre-PR code, but with some extra steps and complexity.

other thoughts:
The gains from committee grouping are fewer calls to get_validator_and_cluster, get_voting_context, and similar, but these are in-memory lookups. And decide_instance is already engineered to handle multiple calls to the same instance gracefully.

Given that, I'm not sure the added complexity
(sign_committee_attestations, committee grouping HashMap, AttestationData type alias) justifies itself. What do you think about keeping the per-validator approach? And def lmk if i'm overlooking some solid gains we'd get from this PR

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment! The original idea was that operations like deciding on a vote and collecting the signatures are preformed on a per committee basis, it would make sense to group them by committee. Also you wouldn't need the QBFT instance per validator as its the same for everyone in the committee. But I agree decide_instance is engineered to handle multiple calls to the same instance.
Although some overhead is avoided like fewer DashMap lookups, fewer oneshot channels... etc. but these are probably cheap operations.

Would like to hear @dknopik's input, if we agree the issue wasn't that significant I am okay with closing the PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this made me realize that the Pre-PR code is also flawed: if even one involved SSV committee is not able to sign (e.g. due to offline operators), the other signing futures will complete but the results will not be processed until those other signings time out. :/

Comment on lines 1318 to 1341
let completed = self
.qbft_manager
.decide_instance(
CommitteeInstanceId {
committee: cluster.committee_id(),
instance_height: attestation.data().slot.as_usize().into(),
committee: committee_id,
instance_height: slot.as_usize().into(),
},
BeaconVote {
block_root: attestation.data().beacon_block_root,
source: attestation.data().source,
target: attestation.data().target,
block_root: first_att_data.beacon_block_root,
source: first_att_data.source,
target: first_att_data.target,
},
self.create_beacon_vote_validator(
attestation.data().slot,
validator_attestation_committees,
),
self.create_beacon_vote_validator(slot, validator_attestation_committees),
timeout_mode,
&cluster,
)
.await
.map_err(SpecificError::from)?;
drop(timer);

let data = match completed {
Completed::TimedOut => return Err(Error::SpecificError(SpecificError::Timeout)),
Completed::Success(data) => data,
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can move this outside the loop

let (validator, cluster) = self.get_validator_and_cluster(validator_pubkey)?;
let voting_context_tx = self.get_voting_context(attestation.data().slot).await?;
// Look up cluster for QBFT consensus (all validators share the same cluster)
let (_, cluster) = self.get_validator_and_cluster(attestations[0].1)?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// All validators in the same committee share the same cluster is in fact incorrect.

But the cluster is actually not needed, if you modify decide_instance. Decide instance actually only needs the cluster members - which are in fact equal to the committee members. :)

Comment on lines 1374 to 1387
let signature = match self
.collect_signature(
PartialSignatureKind::PostConsensus,
Role::Committee,
CollectionMode::Committee {
num_signatures_to_collect,
base_hash: data_hash,
},
&validator,
&cluster,
signing_root,
slot,
)
.await
Copy link
Member

@dknopik dknopik Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will block as it will wait for collect_signature calls for the other validators - which will happen in the next iterations, so this deadlocks.

Therefore, we need a variant that can register for multiple validators.

Likely, this is why the testnet fails.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants