-
Notifications
You must be signed in to change notification settings - Fork 3.9k
[consensus] Add optimistic randomness share verification #18646
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
a3cf572 to
85cd15a
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
5c9a772 to
69cf922
Compare
Combine the happy path, corrupt share, and unhappy path scenarios into a single test to avoid redundant swarm startups. The merged test covers: baseline randomness, corrupt share with sufficient valid shares, stall when below threshold, and recovery after restarting a stopped validator. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
69cf922 to
e8cc7bd
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Use V1 randomness config instead of V2 (default_enabled) so the fast randomness path is disabled and the slow path is always exercised. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
| } | ||
| if self.total_weight < rand_config.threshold() { | ||
| return Either::Left(self); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Blocking crypto verification holds RandStore mutex for milliseconds
Medium Severity
pre_aggregate_verify performs expensive multi-pairing cryptographic checks synchronously inside ShareAggregator::try_aggregate, which is called from RandStore::add_share and add_rand_metadata while the RandStore Mutex is held. Before this change, try_aggregate only did a cheap threshold check before spawning. Now the mutex is held for the duration of the batch verification (~1-5ms happy path, much longer on fallback), blocking concurrent store access from the reliable broadcast callback path (ShareAggregateState::add) and the main event loop. The prior code did all crypto verification outside the mutex (at message receipt time).
Additional Locations (1)
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
| THREAD_MANAGER.get_non_exe_cpu_pool(), | ||
| ) | ||
| .map_err(|e| anyhow!("Share::aggregate failed with WVUF derive_eval error: {e}"))?; | ||
| debug!("WVUF derivation time: {} ms", timer.elapsed().as_millis()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replace debug print with proper counter?
| let batch_ok = Self::build_apks_and_proofs(&shares_vec, rand_config) | ||
| .and_then(|apks_and_proofs| { | ||
| let proof = WVUF::aggregate_shares(&rand_config.wconfig, &apks_and_proofs); | ||
| let metadata_serialized = bcs::to_bytes(rand_metadata) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this probably can be moved out and share with aggregate function
Add two new Prometheus histograms to measure randomness verification and aggregation performance: - aptos_consensus_rand_pre_aggregate_verify_seconds - aptos_consensus_rand_aggregation_seconds Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
✅ Forge suite
|
✅ Forge suite
|
…le parallel pass Use pairing bilinearity to swap G2 scalar muls for cheaper G1 scalar muls, and fuse the scalar mul + to_affine + miller_loop into a single rayon par_iter to avoid double thread-pool scheduling overhead. Reduces pre_aggregate_verify latency by ~12% for 150 nodes (8.9ms → 7.8ms). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


Summary
Adds optimistic randomness share verification to reduce per-share cryptographic overhead. Instead of individually verifying each share on receipt (expensive pairing per share), shares now pass only cheap structural checks on receipt and are batch-verified at aggregation time using a single multi-pairing check. On batch failure, falls back to parallel individual verification to identify and discard bad shares.
Motivation
Previously, every inbound
RandSharewas individually verified with a full pairing check on receipt. For N validators, this means N expensive pairing operations per round. With optimistic batch verification, the happy path (no bad shares) requires only a single multi-pairing check at aggregation time — significantly reducing CPU overhead.Design
Optimistic verification flow
On share receipt: Only cheap structural checks via
optimistic_verify()— valid author, certified APK, correct metadata. No cryptographic verification.At aggregation time (threshold reached):
pre_aggregate_verify()performs batch verification:WVUF::verify_proof()(single multi-pairing check)After bad share removal: If remaining valid shares still meet threshold → proceed to aggregation. If below threshold → stay in
PendingDecisionand wait for more shares.Pessimistic set: Authors whose shares failed verification are tracked. Future shares from these authors are individually verified on receipt, bypassing the optimistic path.
Design tradeoff: blocking pre-verification vs. async verification with retry
Two approaches were considered for handling verification failures:
Approach A — Async verification with retry (rejected, see #18698): Run verification inside the async aggregation spawn. On failure, introduce a new
Aggregatingpending state, send the result back through a channel, and retry aggregation after removing bad shares. This requires new state machine transitions, a recovery handler, and re-broadcast logic for items stuck in theAggregatingstate.Approach B — Blocking pre-verification (chosen): Run
pre_aggregate_verify()synchronously before spawning the async aggregation task. Nodes know upfront whether aggregation will succeed. If not enough valid shares remain after removing bad ones, theShareAggregatorsimply stays inPendingDecision— no new states, no channels, no retry logic needed.Approach B was chosen because it is significantly simpler (no new state machine states or recovery paths) while having minimal performance impact —
pre_aggregate_verifyruns on the non-exe thread pool and only adds latency when bad shares are present (the fallback path). In the happy case (no bad shares), the batch verification is fast and the overhead of running it synchronously is negligible.Parallelized WVUF verification
The Pinkas
WVUF::verify_proof()implementation now accepts a rayonThreadPoolparameter and parallelizes both G2 scalar multiplications and the multi-pairing check. The non-exe CPU pool (16-32 threads) is used to avoid blocking the execution thread pool.Changes
config/consensus_config.rsoptimistic_rand_share_verificationflag (defaulttrue)consensus/types.rspre_aggregate_verifytoTSharetrait; implement batch verify + parallel fallback forShare; remove redundant verification fromShare::aggregateconsensus/rand_store.rspre_aggregate_verifybefore spawning; remove bad shares and re-check threshold; use non-exe thread pool for aggregationconsensus/network_messages.rsoptimistic_verify()consensus/reliable_broadcast_state.rsoptimistic_verify()consensus/epoch_manager.rsRandConfigaptos-dkg/weighted_vuf/traits.rsthread_poolparam toverify_proofaptos-dkg/weighted_vuf/pinkas/mod.rsverify_proofaptos-dkg/weighted_vuf/bls/mod.rsthread_poolparam (unused)aptos-dkg/benches/weighted_vuf.rsverify_proofacross thread countssmoke-test/optimistic_verification.rsTest plan
cargo check/ clippy / fmt clean🤖 Generated with Claude Code
Note
Medium Risk
Touches consensus randomness verification and aggregation plus underlying cryptographic verification APIs; correctness/performance regressions could stall randomness-dependent progress if the new optimistic/batch path misbehaves.
Overview
Adds an optimistic randomness-share verification path: incoming
RandShares now do only cheap structural validation on receipt and defer cryptographic verification to a batch check when enough shares exist to aggregate.On aggregation, the share store runs a new
pre_aggregate_verifystep to batch-verify shares (and, on failure, fall back to parallel individual verification), drops invalid authors, tracks offenders in a pessimistic set for future per-share verification, and adds new metrics for verification/aggregation latency. This also threads an aggregate public key + config flag throughRandConfig, updates WVUFverify_proofto accept a rayon thread pool with a more parallel Pinkas implementation, and adds unit + smoke tests (including failpoint-based corrupt-share scenarios) to cover happy-path, fallback, stall, and recovery behavior.Written by Cursor Bugbot for commit 05906c9. This will update automatically on new commits. Configure here.