[consensus] Add optimistic randomness share verification #18646

danielxiangzl · 2026-02-06T22:46:51Z

Summary

Adds optimistic randomness share verification to reduce per-share cryptographic overhead. Instead of individually verifying each share on receipt (expensive pairing per share), shares now pass only cheap structural checks on receipt and are batch-verified at aggregation time using a single multi-pairing check. On batch failure, falls back to parallel individual verification to identify and discard bad shares.

Motivation

Previously, every inbound RandShare was individually verified with a full pairing check on receipt. For N validators, this means N expensive pairing operations per round. With optimistic batch verification, the happy path (no bad shares) requires only a single multi-pairing check at aggregation time — significantly reducing CPU overhead.

Design

Optimistic verification flow

On share receipt: Only cheap structural checks via optimistic_verify() — valid author, certified APK, correct metadata. No cryptographic verification.
At aggregation time (threshold reached): pre_aggregate_verify() performs batch verification:
- Aggregate all shares and run WVUF::verify_proof() (single multi-pairing check)
- If batch passes → proceed to aggregation (happy path)
- If batch fails → fall back to parallel individual verification via rayon, identify bad shares, remove them, add authors to pessimistic set
After bad share removal: If remaining valid shares still meet threshold → proceed to aggregation. If below threshold → stay in PendingDecision and wait for more shares.
Pessimistic set: Authors whose shares failed verification are tracked. Future shares from these authors are individually verified on receipt, bypassing the optimistic path.

Design tradeoff: blocking pre-verification vs. async verification with retry

Two approaches were considered for handling verification failures:

Approach A — Async verification with retry (rejected, see #18698): Run verification inside the async aggregation spawn. On failure, introduce a new Aggregating pending state, send the result back through a channel, and retry aggregation after removing bad shares. This requires new state machine transitions, a recovery handler, and re-broadcast logic for items stuck in the Aggregating state.

Approach B — Blocking pre-verification (chosen): Run pre_aggregate_verify() synchronously before spawning the async aggregation task. Nodes know upfront whether aggregation will succeed. If not enough valid shares remain after removing bad ones, the ShareAggregator simply stays in PendingDecision — no new states, no channels, no retry logic needed.

Approach B was chosen because it is significantly simpler (no new state machine states or recovery paths) while having minimal performance impact — pre_aggregate_verify runs on the non-exe thread pool and only adds latency when bad shares are present (the fallback path). In the happy case (no bad shares), the batch verification is fast and the overhead of running it synchronously is negligible.

Parallelized WVUF verification

The Pinkas WVUF::verify_proof() implementation now accepts a rayon ThreadPool parameter and parallelizes both G2 scalar multiplications and the multi-pairing check. The non-exe CPU pool (16-32 threads) is used to avoid blocking the execution thread pool.

Changes

File	Change
`config/consensus_config.rs`	Add `optimistic_rand_share_verification` flag (default `true`)
`consensus/types.rs`	Add `pre_aggregate_verify` to `TShare` trait; implement batch verify + parallel fallback for `Share`; remove redundant verification from `Share::aggregate`
`consensus/rand_store.rs`	Call `pre_aggregate_verify` before spawning; remove bad shares and re-check threshold; use non-exe thread pool for aggregation
`consensus/network_messages.rs`	Switch share receipt to `optimistic_verify()`
`consensus/reliable_broadcast_state.rs`	Switch RB share receipt to `optimistic_verify()`
`consensus/epoch_manager.rs`	Thread aggregate PK and config flag into `RandConfig`
`aptos-dkg/weighted_vuf/traits.rs`	Add `thread_pool` param to `verify_proof`
`aptos-dkg/weighted_vuf/pinkas/mod.rs`	Parallelize G2 scalar muls + multi-pairing in `verify_proof`
`aptos-dkg/weighted_vuf/bls/mod.rs`	Accept `thread_pool` param (unused)
`aptos-dkg/benches/weighted_vuf.rs`	Benchmark `verify_proof` across thread counts
`smoke-test/optimistic_verification.rs`	Combined smoke test: happy path, corrupt share fallback, chain stall, recovery

Test plan

9 unit tests covering: happy path, bad share fallback, optimistic skip, pessimistic set, disabled mode, insufficient valid shares, pre_aggregate_verify (happy/bad/non-optimistic)
1 combined smoke test with 4 phases: happy path → corrupt share with sufficient shares → chain stall (below threshold) → recovery after validator restart
All existing randomness tests pass
cargo check / clippy / fmt clean

🤖 Generated with Claude Code

Note

Medium Risk
Touches consensus randomness verification and aggregation plus underlying cryptographic verification APIs; correctness/performance regressions could stall randomness-dependent progress if the new optimistic/batch path misbehaves.

Overview
Adds an optimistic randomness-share verification path: incoming RandShares now do only cheap structural validation on receipt and defer cryptographic verification to a batch check when enough shares exist to aggregate.

On aggregation, the share store runs a new pre_aggregate_verify step to batch-verify shares (and, on failure, fall back to parallel individual verification), drops invalid authors, tracks offenders in a pessimistic set for future per-share verification, and adds new metrics for verification/aggregation latency. This also threads an aggregate public key + config flag through RandConfig, updates WVUF verify_proof to accept a rayon thread pool with a more parallel Pinkas implementation, and adds unit + smoke tests (including failpoint-based corrupt-share scenarios) to cover happy-path, fallback, stall, and recovery behavior.

^{Written by Cursor Bugbot for commit 05906c9. This will update automatically on new commits. Configure here.}

consensus/src/rand/rand_gen/types.rs

Combine the happy path, corrupt share, and unhappy path scenarios into a single test to avoid redundant swarm startups. The merged test covers: baseline randomness, corrupt share with sufficient valid shares, stall when below threshold, and recovery after restarting a stopped validator. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Use V1 randomness config instead of V2 (default_enabled) so the fast randomness path is disabled and the slow path is always exercised. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

cursor · 2026-02-11T02:26:08Z

consensus/src/rand/rand_gen/rand_store.rs

+        }
+        if self.total_weight < rand_config.threshold() {
+            return Either::Left(self);
+        }


Blocking crypto verification holds RandStore mutex for milliseconds

Medium Severity

pre_aggregate_verify performs expensive multi-pairing cryptographic checks synchronously inside ShareAggregator::try_aggregate, which is called from RandStore::add_share and add_rand_metadata while the RandStore Mutex is held. Before this change, try_aggregate only did a cheap threshold check before spawning. Now the mutex is held for the duration of the batch verification (~1-5ms happy path, much longer on fallback), blocking concurrent store access from the reliable broadcast callback path (ShareAggregateState::add) and the main event loop. The prior code did all crypto verification outside the mutex (at message receipt time).

Additional Locations (1)

consensus/src/rand/rand_gen/rand_store.rs#L83-L103

zekun000 · 2026-02-11T20:44:48Z

consensus/src/rand/rand_gen/types.rs

+            THREAD_MANAGER.get_non_exe_cpu_pool(),
+        )
+        .map_err(|e| anyhow!("Share::aggregate failed with WVUF derive_eval error: {e}"))?;
+        debug!("WVUF derivation time: {} ms", timer.elapsed().as_millis());


replace debug print with proper counter?

zekun000 · 2026-02-11T20:46:46Z

consensus/src/rand/rand_gen/types.rs

+        let batch_ok = Self::build_apks_and_proofs(&shares_vec, rand_config)
+            .and_then(|apks_and_proofs| {
+                let proof = WVUF::aggregate_shares(&rand_config.wconfig, &apks_and_proofs);
+                let metadata_serialized = bcs::to_bytes(rand_metadata)


nit: this probably can be moved out and share with aggregate function

Add two new Prometheus histograms to measure randomness verification and aggregation performance: - aptos_consensus_rand_pre_aggregate_verify_seconds - aptos_consensus_rand_aggregation_seconds Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions · 2026-02-11T22:02:13Z

✅ Forge suite `compat` success on `f7584aaea312be8840db1ba2e97a961af7c54f87` ==> `fe80749788a534216dcb1c1414212733d4107a9e`

Compatibility test results for f7584aaea312be8840db1ba2e97a961af7c54f87 ==> fe80749788a534216dcb1c1414212733d4107a9e (PR)
1. Check liveness of validators at old version: f7584aaea312be8840db1ba2e97a961af7c54f87
compatibility::simple-validator-upgrade::liveness-check : committed: 13508.49 txn/s, latency: 2569.98 ms, (p50: 2700 ms, p70: 2800, p90: 3300 ms, p99: 3700 ms), latency samples: 442800
2. Upgrading first Validator to new version: fe80749788a534216dcb1c1414212733d4107a9e
compatibility::simple-validator-upgrade::single-validator-upgrade : committed: 5918.00 txn/s, latency: 5704.35 ms, (p50: 6300 ms, p70: 6400, p90: 6500 ms, p99: 6600 ms), latency samples: 204120
3. Upgrading rest of first batch to new version: fe80749788a534216dcb1c1414212733d4107a9e
compatibility::simple-validator-upgrade::half-validator-upgrade : committed: 5931.85 txn/s, latency: 5654.87 ms, (p50: 6300 ms, p70: 6400, p90: 6500 ms, p99: 6600 ms), latency samples: 205080
4. upgrading second batch to new version: fe80749788a534216dcb1c1414212733d4107a9e
compatibility::simple-validator-upgrade::rest-validator-upgrade : committed: 10323.13 txn/s, latency: 3177.72 ms, (p50: 3000 ms, p70: 3800, p90: 4300 ms, p99: 4500 ms), latency samples: 342860
5. check swarm health
Compatibility test for f7584aaea312be8840db1ba2e97a961af7c54f87 ==> fe80749788a534216dcb1c1414212733d4107a9e passed
Test Ok

github-actions · 2026-02-11T22:04:23Z

✅ Forge suite `realistic_env_max_load` success on `fe80749788a534216dcb1c1414212733d4107a9e`

Forge report malformed: Expecting ',' delimiter: line 8 column 6 (char 148)
'{\n  "metrics": [\n    {\n      "test_name": "two traffics test: inner traffic",\n      "metric": "submitted_txn",\n      "value": 5057160.0\n    },\n[2026-02-11T22:04:18Z INFO  aptos_forge::report] Test Ok\n    {\n      "test_name": "two traffics test: inner traffic",\n      "metric": "expired_txn",\n      "value": 0.0\n    },\n    {\n      "test_name": "two traffics test: inner traffic",\n      "metric": "avg_tps",\n      "value": 13609.558555399399\n    },\n    {\n      "test_name": "two traffics test: inner traffic",\n      "metric": "avg_latency",\n      "value": 2769.29146319278\n    },\n    {\n      "test_name": "two traffics test: inner traffic",\n      "metric": "p50_latency",\n      "value": 2700.0\n    },\n    {\n      "test_name": "two traffics test: inner traffic",\n      "metric": "p90_latency",\n      "value": 3000.0\n    },\n    {\n      "test_name": "two traffics test: inner traffic",\n      "metric": "p99_latency",\n      "value": 3600.0\n    },\n    {\n      "test_name": "two traffics test",\n      "metric": "submitted_txn",\n      "value": 42740.0\n    },\n    {\n      "test_name": "two traffics test",\n      "metric": "expired_txn",\n      "value": 0.0\n    },\n    {\n      "test_name": "two traffics test",\n      "metric": "avg_tps",\n      "value": 100.01470442884742\n    },\n    {\n      "test_name": "two traffics test",\n      "metric": "avg_latency",\n      "value": 793.6168674698795\n    },\n    {\n      "test_name": "two traffics test",\n      "metric": "p50_latency",\n      "value": 700.0\n    },\n    {\n      "test_name": "two traffics test",\n      "metric": "p90_latency",\n      "value": 900.0\n    },\n    {\n      "test_name": "two traffics test",\n      "metric": "p99_latency",\n      "value": 4300.0\n    }\n  ],\n  "text": "two traffics test: inner traffic : committed: 13609.56 txn/s, latency: 2769.29 ms, (p50: 2700 ms, p70: 2900, p90: 3000 ms, p99: 3600 ms), latency samples: 5057160\\ntwo traffics test : committed: 100.01 txn/s, latency: 793.62 ms, (p50: 700 ms, p70: 800, p90: 900 ms, p99: 4300 ms), latency samples: 1660\\nLatency breakdown for phase 0: [\\"MempoolToBlockCreation: max: 2.232, avg: 2.151\\", \\"ConsensusProposalToOrdered: max: 0.168, avg: 0.166\\", \\"ConsensusOrderedToCommit: max: 0.046, avg: 0.042\\", \\"ConsensusProposalToCommit: max: 0.214, avg: 0.208\\"]\\nMax non-epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 0.42s no progress at version 5596263 (avg 0.07s) [limit 15].\\nMax epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 0.31s no progress at version 2362455 (avg 0.31s) [limit 16].\\nTest Ok"\n}'
Trailing Log Lines:
networkchaos.chaos-mesh.org "4-gcp--as-southeast1-to-3-gcp--us-east4-netem" deleted from forge-e2e-pr-18646 namespace
test CompositeNetworkTest ... ok
Test Statistics: 
two traffics test: inner traffic : committed: 13609.56 txn/s, latency: 2769.29 ms, (p50: 2700 ms, p70: 2900, p90: 3000 ms, p99: 3600 ms), latency samples: 5057160
two traffics test : committed: 100.01 txn/s, latency: 793.62 ms, (p50: 700 ms, p70: 800, p90: 900 ms, p99: 4300 ms), latency samples: 1660
Latency breakdown for phase 0: ["MempoolToBlockCreation: max: 2.232, avg: 2.151", "ConsensusProposalToOrdered: max: 0.168, avg: 0.166", "ConsensusOrderedToCommit: max: 0.046, avg: 0.042", "ConsensusProposalToCommit: max: 0.214, avg: 0.208"]
Max non-epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 0.42s no progress at version 5596263 (avg 0.07s) [limit 15].
Max epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 0.31s no progress at version 2362455 (avg 0.31s) [limit 16].
Test Ok

=== BEGIN JUNIT ===
<?xml version="1.0" encoding="UTF-8"?>
<testsuites name="forge" tests="1" failures="0" errors="0" uuid="591d6ba6-20b3-4223-9581-2823c7418023">
    <testsuite name="local" tests="1" disabled="0" errors="0" failures="0">
        <testcase name="CompositeNetworkTest(network:multi-region-network-emulation(two traffics test)) with ">
        </testcase>
    </testsuite>
</testsuites>
=== END JUNIT ===
[2026-02-11T22:04:18Z INFO  aptos_forge::backend::k8s::cluster_helper] Deleting namespace forge-e2e-pr-18646: Some(NamespaceStatus { conditions: None, phase: Some("Terminating") })
[2026-02-11T22:04:18Z INFO  aptos_forge::backend::k8s::cluster_helper] aptos-node resources for Forge removed in namespace: forge-e2e-pr-18646

test result: ok. 1 passed; 0 soft failed; 0 hard failed; 0 filtered out

Debugging output:
NAME                                         READY   STATUS      RESTARTS   AGE
aptos-node-0-fullnode-eforgea9c006e2-0       1/1     Running     0          12m
aptos-node-0-validator-0                     1/1     Running     0          12m
aptos-node-1-fullnode-eforgea9c006e2-0       1/1     Running     0          12m
aptos-node-1-validator-0                     1/1     Running     0          12m
aptos-node-2-fullnode-eforgea9c006e2-0       1/1     Running     0          12m
aptos-node-2-validator-0                     1/1     Running     0          12m
aptos-node-3-fullnode-eforgea9c006e2-0       1/1     Running     0          12m
aptos-node-3-validator-0                     1/1     Running     0          12m
aptos-node-4-fullnode-eforgea9c006e2-0       1/1     Running     0          12m
aptos-node-4-validator-0                     1/1     Running     0          12m
aptos-node-5-validator-0                     1/1     Running     0          12m
aptos-node-6-validator-0                     1/1     Running     0          12m
forge-testnet-deployer-46wgb                 0/1     Completed   0          13m
genesis-aptos-genesis-eforgea9c006e2-24fsh   0/1     Completed   0          12m

…le parallel pass Use pairing bilinearity to swap G2 scalar muls for cheaper G1 scalar muls, and fuse the scalar mul + to_affine + miller_loop into a single rayon par_iter to avoid double thread-pool scheduling overhead. Reduces pre_aggregate_verify latency by ~12% for 150 nodes (8.9ms → 7.8ms). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

danielxiangzl requested review from JoshLind, gregnazario, ibalajiarun, sasha8 and zekun000 as code owners February 6, 2026 22:46

ibalajiarun approved these changes Feb 6, 2026

View reviewed changes

cursor bot reviewed Feb 6, 2026

View reviewed changes

consensus/src/rand/rand_gen/types.rs Outdated Show resolved Hide resolved

cursor bot reviewed Feb 6, 2026

View reviewed changes

consensus/src/rand/rand_gen/types.rs Show resolved Hide resolved

danielxiangzl added CICD:run-e2e-tests when this label is present github actions will run all land-blocking e2e tests from the PR labels Feb 7, 2026

ibalajiarun added the v1.41 label Feb 7, 2026