Skip to content

Conversation

@danielxiangzl
Copy link
Contributor

@danielxiangzl danielxiangzl commented Feb 6, 2026

Summary

Adds optimistic randomness share verification to reduce per-share cryptographic overhead. Instead of individually verifying each share on receipt (expensive pairing per share), shares now pass only cheap structural checks on receipt and are batch-verified at aggregation time using a single multi-pairing check. On batch failure, falls back to parallel individual verification to identify and discard bad shares.

Motivation

Previously, every inbound RandShare was individually verified with a full pairing check on receipt. For N validators, this means N expensive pairing operations per round. With optimistic batch verification, the happy path (no bad shares) requires only a single multi-pairing check at aggregation time — significantly reducing CPU overhead.

Design

Optimistic verification flow

  1. On share receipt: Only cheap structural checks via optimistic_verify() — valid author, certified APK, correct metadata. No cryptographic verification.

  2. At aggregation time (threshold reached): pre_aggregate_verify() performs batch verification:

    • Aggregate all shares and run WVUF::verify_proof() (single multi-pairing check)
    • If batch passes → proceed to aggregation (happy path)
    • If batch fails → fall back to parallel individual verification via rayon, identify bad shares, remove them, add authors to pessimistic set
  3. After bad share removal: If remaining valid shares still meet threshold → proceed to aggregation. If below threshold → stay in PendingDecision and wait for more shares.

  4. Pessimistic set: Authors whose shares failed verification are tracked. Future shares from these authors are individually verified on receipt, bypassing the optimistic path.

Design tradeoff: blocking pre-verification vs. async verification with retry

Two approaches were considered for handling verification failures:

Approach A — Async verification with retry (rejected, see #18698): Run verification inside the async aggregation spawn. On failure, introduce a new Aggregating pending state, send the result back through a channel, and retry aggregation after removing bad shares. This requires new state machine transitions, a recovery handler, and re-broadcast logic for items stuck in the Aggregating state.

Approach B — Blocking pre-verification (chosen): Run pre_aggregate_verify() synchronously before spawning the async aggregation task. Nodes know upfront whether aggregation will succeed. If not enough valid shares remain after removing bad ones, the ShareAggregator simply stays in PendingDecision — no new states, no channels, no retry logic needed.

Approach B was chosen because it is significantly simpler (no new state machine states or recovery paths) while having minimal performance impact — pre_aggregate_verify runs on the non-exe thread pool and only adds latency when bad shares are present (the fallback path). In the happy case (no bad shares), the batch verification is fast and the overhead of running it synchronously is negligible.

Parallelized WVUF verification

The Pinkas WVUF::verify_proof() implementation now accepts a rayon ThreadPool parameter and parallelizes both G2 scalar multiplications and the multi-pairing check. The non-exe CPU pool (16-32 threads) is used to avoid blocking the execution thread pool.

Changes

File Change
config/consensus_config.rs Add optimistic_rand_share_verification flag (default true)
consensus/types.rs Add pre_aggregate_verify to TShare trait; implement batch verify + parallel fallback for Share; remove redundant verification from Share::aggregate
consensus/rand_store.rs Call pre_aggregate_verify before spawning; remove bad shares and re-check threshold; use non-exe thread pool for aggregation
consensus/network_messages.rs Switch share receipt to optimistic_verify()
consensus/reliable_broadcast_state.rs Switch RB share receipt to optimistic_verify()
consensus/epoch_manager.rs Thread aggregate PK and config flag into RandConfig
aptos-dkg/weighted_vuf/traits.rs Add thread_pool param to verify_proof
aptos-dkg/weighted_vuf/pinkas/mod.rs Parallelize G2 scalar muls + multi-pairing in verify_proof
aptos-dkg/weighted_vuf/bls/mod.rs Accept thread_pool param (unused)
aptos-dkg/benches/weighted_vuf.rs Benchmark verify_proof across thread counts
smoke-test/optimistic_verification.rs Combined smoke test: happy path, corrupt share fallback, chain stall, recovery

Test plan

  • 9 unit tests covering: happy path, bad share fallback, optimistic skip, pessimistic set, disabled mode, insufficient valid shares, pre_aggregate_verify (happy/bad/non-optimistic)
  • 1 combined smoke test with 4 phases: happy path → corrupt share with sufficient shares → chain stall (below threshold) → recovery after validator restart
  • All existing randomness tests pass
  • cargo check / clippy / fmt clean

🤖 Generated with Claude Code


Note

Medium Risk
Touches consensus randomness verification and aggregation plus underlying cryptographic verification APIs; correctness/performance regressions could stall randomness-dependent progress if the new optimistic/batch path misbehaves.

Overview
Adds an optimistic randomness-share verification path: incoming RandShares now do only cheap structural validation on receipt and defer cryptographic verification to a batch check when enough shares exist to aggregate.

On aggregation, the share store runs a new pre_aggregate_verify step to batch-verify shares (and, on failure, fall back to parallel individual verification), drops invalid authors, tracks offenders in a pessimistic set for future per-share verification, and adds new metrics for verification/aggregation latency. This also threads an aggregate public key + config flag through RandConfig, updates WVUF verify_proof to accept a rayon thread pool with a more parallel Pinkas implementation, and adds unit + smoke tests (including failpoint-based corrupt-share scenarios) to cover happy-path, fallback, stall, and recovery behavior.

Written by Cursor Bugbot for commit 05906c9. This will update automatically on new commits. Configure here.

@danielxiangzl danielxiangzl added CICD:run-e2e-tests when this label is present github actions will run all land-blocking e2e tests from the PR labels Feb 7, 2026
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@zekun000 zekun000 force-pushed the danielxiangzl/rand-optimistic-verify branch from a3cf572 to 85cd15a Compare February 7, 2026 01:46
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@danielxiangzl danielxiangzl force-pushed the danielxiangzl/rand-optimistic-verify branch from 5c9a772 to 69cf922 Compare February 10, 2026 23:38
Combine the happy path, corrupt share, and unhappy path scenarios
into a single test to avoid redundant swarm startups. The merged
test covers: baseline randomness, corrupt share with sufficient
valid shares, stall when below threshold, and recovery after
restarting a stopped validator.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@danielxiangzl danielxiangzl force-pushed the danielxiangzl/rand-optimistic-verify branch from 69cf922 to e8cc7bd Compare February 10, 2026 23:48
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

Use V1 randomness config instead of V2 (default_enabled) so the fast
randomness path is disabled and the slow path is always exercised.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

}
if self.total_weight < rand_config.threshold() {
return Either::Left(self);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking crypto verification holds RandStore mutex for milliseconds

Medium Severity

pre_aggregate_verify performs expensive multi-pairing cryptographic checks synchronously inside ShareAggregator::try_aggregate, which is called from RandStore::add_share and add_rand_metadata while the RandStore Mutex is held. Before this change, try_aggregate only did a cheap threshold check before spawning. Now the mutex is held for the duration of the batch verification (~1-5ms happy path, much longer on fallback), blocking concurrent store access from the reliable broadcast callback path (ShareAggregateState::add) and the main event loop. The prior code did all crypto verification outside the mutex (at message receipt time).

Additional Locations (1)

Fix in Cursor Fix in Web

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

THREAD_MANAGER.get_non_exe_cpu_pool(),
)
.map_err(|e| anyhow!("Share::aggregate failed with WVUF derive_eval error: {e}"))?;
debug!("WVUF derivation time: {} ms", timer.elapsed().as_millis());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace debug print with proper counter?

let batch_ok = Self::build_apks_and_proofs(&shares_vec, rand_config)
.and_then(|apks_and_proofs| {
let proof = WVUF::aggregate_shares(&rand_config.wconfig, &apks_and_proofs);
let metadata_serialized = bcs::to_bytes(rand_metadata)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this probably can be moved out and share with aggregate function

Add two new Prometheus histograms to measure randomness verification
and aggregation performance:
- aptos_consensus_rand_pre_aggregate_verify_seconds
- aptos_consensus_rand_aggregation_seconds

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions
Copy link
Contributor

✅ Forge suite compat success on f7584aaea312be8840db1ba2e97a961af7c54f87 ==> fe80749788a534216dcb1c1414212733d4107a9e

Compatibility test results for f7584aaea312be8840db1ba2e97a961af7c54f87 ==> fe80749788a534216dcb1c1414212733d4107a9e (PR)
1. Check liveness of validators at old version: f7584aaea312be8840db1ba2e97a961af7c54f87
compatibility::simple-validator-upgrade::liveness-check : committed: 13508.49 txn/s, latency: 2569.98 ms, (p50: 2700 ms, p70: 2800, p90: 3300 ms, p99: 3700 ms), latency samples: 442800
2. Upgrading first Validator to new version: fe80749788a534216dcb1c1414212733d4107a9e
compatibility::simple-validator-upgrade::single-validator-upgrade : committed: 5918.00 txn/s, latency: 5704.35 ms, (p50: 6300 ms, p70: 6400, p90: 6500 ms, p99: 6600 ms), latency samples: 204120
3. Upgrading rest of first batch to new version: fe80749788a534216dcb1c1414212733d4107a9e
compatibility::simple-validator-upgrade::half-validator-upgrade : committed: 5931.85 txn/s, latency: 5654.87 ms, (p50: 6300 ms, p70: 6400, p90: 6500 ms, p99: 6600 ms), latency samples: 205080
4. upgrading second batch to new version: fe80749788a534216dcb1c1414212733d4107a9e
compatibility::simple-validator-upgrade::rest-validator-upgrade : committed: 10323.13 txn/s, latency: 3177.72 ms, (p50: 3000 ms, p70: 3800, p90: 4300 ms, p99: 4500 ms), latency samples: 342860
5. check swarm health
Compatibility test for f7584aaea312be8840db1ba2e97a961af7c54f87 ==> fe80749788a534216dcb1c1414212733d4107a9e passed
Test Ok

@github-actions
Copy link
Contributor

✅ Forge suite realistic_env_max_load success on fe80749788a534216dcb1c1414212733d4107a9e

Forge report malformed: Expecting ',' delimiter: line 8 column 6 (char 148)
'{\n  "metrics": [\n    {\n      "test_name": "two traffics test: inner traffic",\n      "metric": "submitted_txn",\n      "value": 5057160.0\n    },\n[2026-02-11T22:04:18Z INFO  aptos_forge::report] Test Ok\n    {\n      "test_name": "two traffics test: inner traffic",\n      "metric": "expired_txn",\n      "value": 0.0\n    },\n    {\n      "test_name": "two traffics test: inner traffic",\n      "metric": "avg_tps",\n      "value": 13609.558555399399\n    },\n    {\n      "test_name": "two traffics test: inner traffic",\n      "metric": "avg_latency",\n      "value": 2769.29146319278\n    },\n    {\n      "test_name": "two traffics test: inner traffic",\n      "metric": "p50_latency",\n      "value": 2700.0\n    },\n    {\n      "test_name": "two traffics test: inner traffic",\n      "metric": "p90_latency",\n      "value": 3000.0\n    },\n    {\n      "test_name": "two traffics test: inner traffic",\n      "metric": "p99_latency",\n      "value": 3600.0\n    },\n    {\n      "test_name": "two traffics test",\n      "metric": "submitted_txn",\n      "value": 42740.0\n    },\n    {\n      "test_name": "two traffics test",\n      "metric": "expired_txn",\n      "value": 0.0\n    },\n    {\n      "test_name": "two traffics test",\n      "metric": "avg_tps",\n      "value": 100.01470442884742\n    },\n    {\n      "test_name": "two traffics test",\n      "metric": "avg_latency",\n      "value": 793.6168674698795\n    },\n    {\n      "test_name": "two traffics test",\n      "metric": "p50_latency",\n      "value": 700.0\n    },\n    {\n      "test_name": "two traffics test",\n      "metric": "p90_latency",\n      "value": 900.0\n    },\n    {\n      "test_name": "two traffics test",\n      "metric": "p99_latency",\n      "value": 4300.0\n    }\n  ],\n  "text": "two traffics test: inner traffic : committed: 13609.56 txn/s, latency: 2769.29 ms, (p50: 2700 ms, p70: 2900, p90: 3000 ms, p99: 3600 ms), latency samples: 5057160\\ntwo traffics test : committed: 100.01 txn/s, latency: 793.62 ms, (p50: 700 ms, p70: 800, p90: 900 ms, p99: 4300 ms), latency samples: 1660\\nLatency breakdown for phase 0: [\\"MempoolToBlockCreation: max: 2.232, avg: 2.151\\", \\"ConsensusProposalToOrdered: max: 0.168, avg: 0.166\\", \\"ConsensusOrderedToCommit: max: 0.046, avg: 0.042\\", \\"ConsensusProposalToCommit: max: 0.214, avg: 0.208\\"]\\nMax non-epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 0.42s no progress at version 5596263 (avg 0.07s) [limit 15].\\nMax epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 0.31s no progress at version 2362455 (avg 0.31s) [limit 16].\\nTest Ok"\n}'
Trailing Log Lines:
networkchaos.chaos-mesh.org "4-gcp--as-southeast1-to-3-gcp--us-east4-netem" deleted from forge-e2e-pr-18646 namespace
test CompositeNetworkTest ... ok
Test Statistics: 
two traffics test: inner traffic : committed: 13609.56 txn/s, latency: 2769.29 ms, (p50: 2700 ms, p70: 2900, p90: 3000 ms, p99: 3600 ms), latency samples: 5057160
two traffics test : committed: 100.01 txn/s, latency: 793.62 ms, (p50: 700 ms, p70: 800, p90: 900 ms, p99: 4300 ms), latency samples: 1660
Latency breakdown for phase 0: ["MempoolToBlockCreation: max: 2.232, avg: 2.151", "ConsensusProposalToOrdered: max: 0.168, avg: 0.166", "ConsensusOrderedToCommit: max: 0.046, avg: 0.042", "ConsensusProposalToCommit: max: 0.214, avg: 0.208"]
Max non-epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 0.42s no progress at version 5596263 (avg 0.07s) [limit 15].
Max epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 0.31s no progress at version 2362455 (avg 0.31s) [limit 16].
Test Ok

=== BEGIN JUNIT ===
<?xml version="1.0" encoding="UTF-8"?>
<testsuites name="forge" tests="1" failures="0" errors="0" uuid="591d6ba6-20b3-4223-9581-2823c7418023">
    <testsuite name="local" tests="1" disabled="0" errors="0" failures="0">
        <testcase name="CompositeNetworkTest(network:multi-region-network-emulation(two traffics test)) with ">
        </testcase>
    </testsuite>
</testsuites>
=== END JUNIT ===
[2026-02-11T22:04:18Z INFO  aptos_forge::backend::k8s::cluster_helper] Deleting namespace forge-e2e-pr-18646: Some(NamespaceStatus { conditions: None, phase: Some("Terminating") })
[2026-02-11T22:04:18Z INFO  aptos_forge::backend::k8s::cluster_helper] aptos-node resources for Forge removed in namespace: forge-e2e-pr-18646

test result: ok. 1 passed; 0 soft failed; 0 hard failed; 0 filtered out

Debugging output:
NAME                                         READY   STATUS      RESTARTS   AGE
aptos-node-0-fullnode-eforgea9c006e2-0       1/1     Running     0          12m
aptos-node-0-validator-0                     1/1     Running     0          12m
aptos-node-1-fullnode-eforgea9c006e2-0       1/1     Running     0          12m
aptos-node-1-validator-0                     1/1     Running     0          12m
aptos-node-2-fullnode-eforgea9c006e2-0       1/1     Running     0          12m
aptos-node-2-validator-0                     1/1     Running     0          12m
aptos-node-3-fullnode-eforgea9c006e2-0       1/1     Running     0          12m
aptos-node-3-validator-0                     1/1     Running     0          12m
aptos-node-4-fullnode-eforgea9c006e2-0       1/1     Running     0          12m
aptos-node-4-validator-0                     1/1     Running     0          12m
aptos-node-5-validator-0                     1/1     Running     0          12m
aptos-node-6-validator-0                     1/1     Running     0          12m
forge-testnet-deployer-46wgb                 0/1     Completed   0          13m
genesis-aptos-genesis-eforgea9c006e2-24fsh   0/1     Completed   0          12m

…le parallel pass

Use pairing bilinearity to swap G2 scalar muls for cheaper G1 scalar
muls, and fuse the scalar mul + to_affine + miller_loop into a single
rayon par_iter to avoid double thread-pool scheduling overhead.

Reduces pre_aggregate_verify latency by ~12% for 150 nodes (8.9ms → 7.8ms).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CICD:run-e2e-tests when this label is present github actions will run all land-blocking e2e tests from the PR v1.41

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants