Add GPU microbenchmark mode with per-component timing by arkadianet · Pull Request #6 · arkadianet/erg-vanity-gpu

arkadianet · 2026-01-03T07:10:03Z

Add --bench flag to measure individual GPU kernel performance using OpenCL event profiling timestamps:

PBKDF2: BIP39 seed derivation (2048 iterations)
BIP32: Key derivation for m/44'/429'/0'/0 path
secp256k1: Public key generation from private key
Base58: Address encoding with Blake2b checksum

Features:

Per-device and combined multi-GPU statistics
Configurable iterations, warmup, batch size, and num_indices
--bench-validate flag to verify kernels aren't optimized away
Uses exact production codepaths for accurate measurements

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Added a GPU benchmark mode in the CLI to measure cryptographic pipeline performance across devices.
- New benchmark options: iterations, warmup runs, batch size, per-index count, and optional result validation.
- Per-device and combined reports with timing breakdowns for PBKDF2, BIP32, secp256k1, and Base58 stages; supports multi-device aggregation and profiling-enabled device selection.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Add --bench flag to measure individual GPU kernel performance using OpenCL event profiling timestamps: - PBKDF2: BIP39 seed derivation (2048 iterations) - BIP32: Key derivation for m/44'/429'/0'/0 path - secp256k1: Public key generation from private key - Base58: Address encoding with Blake2b checksum Features: - Per-device and combined multi-GPU statistics - Configurable iterations, warmup, batch size, and num_indices - --bench-validate flag to verify kernels aren't optimized away - Uses exact production codepaths for accurate measurements 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

coderabbitai · 2026-01-03T07:10:11Z

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough

Walkthrough

Adds a GPU benchmarking feature: CLI flags to configure benchmarks, new OpenCL benchmark kernels, a benchmark runner with profiling/validation/reporting, profiling-enabled GPU context creation, and program-building support for benchmark kernels. The CLI can run per-device benchmarks and print aggregated results.

Changes

Cohort / File(s)	Summary
CLI Benchmark Configuration `crates/erg-vanity-cli/src/main.rs`	Added benchmark args to `Args` (`bench`, `bench_iters`, `bench_warmup`, `bench_batch_size`, `bench_num_indices`, `bench_validate`); main now runs per-device benchmark flow when `--bench` is set and exits before pattern parsing.
GPU Benchmark Kernels `crates/erg-vanity-gpu/kernels/bench.cl`	New OpenCL source with deterministic entropy/seed helpers and four benchmark kernels (`bench_pbkdf2`, `bench_bip32`, `bench_secp256k1`, `bench_base58`) that produce per-item checksums to avoid dead-code elimination.
GPU Benchmark Runner `crates/erg-vanity-gpu/src/bench.rs`	New public `BenchConfig`, `ComponentStats`, `DeviceBenchStats`; `run_bench_on_device()` runs warmup/timed iterations with profiling, optional validation, and returns per-component timings; `print_bench_results()` prints per-device and combined summaries.
GPU Context Profiling `crates/erg-vanity-gpu/src/context.rs`	Added `GpuContext::with_device_profiling(global_selection)` and internal `with_device_impl(..., enable_profiling)`; command queue is created with profiling enabled when requested.
GPU Kernel Program Builder `crates/erg-vanity-gpu/src/kernel.rs`	Added `BENCH` source constant and `GpuProgram::bench(ctx)` to compile a benchmarking program by composing required kernel sources (bench.cl + dependencies).
GPU Module Export `crates/erg-vanity-gpu/src/lib.rs`	Exported new `bench` module (`pub mod bench;`) to expose the benchmarking API.

Sequence Diagram

sequenceDiagram
    actor User
    participant CLI as CLI (main.rs)
    participant BenchMod as Bench Runner (bench.rs)
    participant GpuCtx as GPU Context (context.rs)
    participant KernProg as Kernel Program (kernel.rs)
    participant OCL as OpenCL Runtime

    User->>CLI: run with --bench and args
    CLI->>CLI: parse args -> BenchConfig
    CLI->>CLI: parse device list

    loop per selected device
        CLI->>BenchMod: run_bench_on_device(device_idx, cfg)

        BenchMod->>GpuCtx: with_device_profiling(device_idx)
        GpuCtx->>OCL: create context & profiling-enabled queue
        OCL-->>GpuCtx: context ready
        GpuCtx-->>BenchMod: GpuContext

        BenchMod->>KernProg: bench(ctx)
        KernProg->>OCL: compile bench.cl + deps
        OCL-->>KernProg: program built
        KernProg-->>BenchMod: GpuProgram

        BenchMod->>OCL: allocate/upload buffers (salt, wordlist, checksums)
        loop for each kernel (pbkdf2,bip32,secp256k1,base58)
            BenchMod->>OCL: run warmup iterations
            OCL-->>BenchMod: complete
            loop timed iterations
                BenchMod->>OCL: enqueue kernel (profiling)
                OCL-->>BenchMod: profiling timestamps
                BenchMod->>BenchMod: extract ns, accumulate ComponentStats
            end
            opt validate
                BenchMod->>BenchMod: validate checksums
            end
        end
        BenchMod-->>CLI: return DeviceBenchStats
    end

    CLI->>BenchMod: print_bench_results(results, cfg)
    BenchMod->>User: formatted per-device & combined tables

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Add CLI improvements, CI, and deterministic output ordering #1: Modifies the CLI Args/entrypoint; directly related to adding benchmark flags and per-device execution paths.

Suggested labels

codex

Poem

🐰 I hopped on silicon light,
Benchmarks humming through the night,
Kernels clack and timers sing,
Checksums dance—results I bring,
Tiny paws applaud the flight.

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Add GPU microbenchmark mode with per-component timing' accurately summarizes the main objective of the pull request, which adds benchmarking functionality to measure individual GPU kernel performance with timing breakdowns.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

📜 Recent review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c3925f1 and 5e6225c.

📒 Files selected for processing (2)

crates/erg-vanity-gpu/kernels/bench.cl
crates/erg-vanity-gpu/src/bench.rs

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Test

🔇 Additional comments (13)

crates/erg-vanity-gpu/kernels/bench.cl (6)

10-37: LGTM!

The entropy generation correctly mixes salt, counter, and gid via Blake2b-256, ensuring deterministic and well-distributed output across work items.

39-53: LGTM!

Simple and effective approach to generate a deterministic 64-byte seed from 32-byte entropy using XOR perturbation.

62-87: LGTM!

The kernel correctly exercises the production bip39_entropy_to_seed codepath, and the checksum folding properly distributes bytes across the uint to prevent optimization.

96-128: LGTM!

Correctly isolates BIP32 derivation timing by skipping PBKDF2. The sentinel value 0xDEAD0001u on chain derivation failure aids debugging.

138-175: LGTM!

The kernel correctly exercises the production secp256k1 codepath (scalar parsing, generator multiplication, pubkey compression). The per-index key variation via XOR is sufficient for benchmark purposes.

203-232: LGTM!

The kernel correctly exercises the production Base58 encoding path with Blake2b checksum computation. The output folding prevents compiler optimization.

crates/erg-vanity-gpu/src/bench.rs (7)

12-37: LGTM!

Reasonable default configuration for GPU benchmarking. The defaults provide a good balance between measurement accuracy and execution time.

39-68: LGTM!

Clean data structures for collecting benchmark statistics with appropriate derives.

70-203: LGTM!

Well-structured benchmark runner with efficient buffer reuse and proper separation of warmup from timed iterations. The counter offsets effectively prevent cross-component caching artifacts.

273-282: LGTM!

Clean extraction of profiling timestamps from OpenCL events.

284-335: LGTM!

Solid benchmark timing implementation with proper validation of profiling timestamps and counter variation to prevent caching effects.

337-406: LGTM!

Clear and informative benchmark output with proper sorting by total time to highlight bottlenecks. The count.max(1) guards against division by zero for average calculations.

408-474: LGTM!

Combined statistics aggregation correctly sums across devices and maintains consistent formatting with the per-device output.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c3925f1b6b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

crates/erg-vanity-gpu/kernels/bench.cl

crates/erg-vanity-gpu/src/bench.rs

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (3)

crates/erg-vanity-gpu/kernels/bench.cl (1)

119-126: Silent continue on address derivation failure may mask kernel issues.

When bip32_derive_address_index fails, the loop silently continues without contributing to the checksum. This is fine for benchmarking valid paths, but if many derivations fail (e.g., due to a bug), the benchmark would appear faster than reality while producing misleading checksums.

Consider tracking failure counts or using a different sentinel pattern for partial failures if validation accuracy is important.
crates/erg-vanity-gpu/src/kernel.rs (1)
204-258: Consider extracting shared kernel concatenation logic.

The bench() and vanity() functions share ~90% identical code for concatenating kernel sources. While the current approach is clear and explicit, a helper function could reduce maintenance burden.
🔎 Possible refactor approach
fn build_combined_source(final_kernel: &str, final_kernel_name: &str) -> String {
    let mut combined = String::with_capacity(
        sources::SHA256.len()
            + sources::SHA512.len()
            // ... rest of sources ...
            + final_kernel.len()
            + 1024,
    );
    
    combined.push_str("// === sha256.cl ===\n");
    combined.push_str(sources::SHA256);
    // ... shared concatenation ...
    
    combined.push_str(&format!("\n\n// === {} ===\n", final_kernel_name));
    combined.push_str(final_kernel);
    combined.push('\n');
    combined
}
crates/erg-vanity-gpu/src/bench.rs (1)
297-320: Potential overflow in counter calculation for large configurations.

The counter calculation counter_offset + ((cfg.warmup as u64) + (iter as u64)) * (cfg.batch_size as u64) could overflow for extreme configurations (e.g., very large batch sizes with many iterations). While unlikely in practice with typical benchmark parameters, consider using saturating_mul or checked_mul for defensive coding.
🔎 Defensive overflow handling
         let counter_start =
-            counter_offset + ((cfg.warmup as u64) + (iter as u64)) * (cfg.batch_size as u64);
+            counter_offset
+                .saturating_add(
+                    ((cfg.warmup as u64) + (iter as u64))
+                        .saturating_mul(cfg.batch_size as u64)
+                );

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fd8ffc5 and c3925f1.

📒 Files selected for processing (6)

crates/erg-vanity-cli/src/main.rs
crates/erg-vanity-gpu/kernels/bench.cl
crates/erg-vanity-gpu/src/bench.rs
crates/erg-vanity-gpu/src/context.rs
crates/erg-vanity-gpu/src/kernel.rs
crates/erg-vanity-gpu/src/lib.rs

🧰 Additional context used

🧬 Code graph analysis (2)

crates/erg-vanity-gpu/src/lib.rs (1)

crates/erg-vanity-gpu/src/kernel.rs (1)

bench (208-258)

crates/erg-vanity-cli/src/main.rs (1)

crates/erg-vanity-gpu/src/bench.rs (2)

run_bench_on_device (71-200)

print_bench_results (329-343)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Test

🔇 Additional comments (19)

crates/erg-vanity-gpu/kernels/bench.cl (6)

1-9: Well-documented kernel dependencies and purpose.

Clear header documenting required kernel concatenation order and design goal of using exact production codepaths.

13-37: LGTM - Deterministic entropy generation.

Properly mixes gid, counter_start, and salt using Blake2b to produce unique 32-byte entropy per work item. The comment notes this is copied verbatim from vanity.cl, ensuring consistent behavior.

41-53: Seed generation approach is sound.

Generates 64-byte seed by hashing entropy twice with XOR perturbation. This provides deterministic, varied seeds for benchmarking BIP32/secp/base58 without PBKDF2 overhead.

62-87: PBKDF2 benchmark kernel looks correct.

Uses the exact production bip39_entropy_to_seed call and XOR-folds the 64-byte seed into a checksum to prevent dead-code elimination.

153-159: Private key variation approach is reasonable for isolation.

XOR-modifying base_privkey with addr_idx provides unique keys per iteration without BIP32 overhead. This isolates secp256k1 timing from other components.

188-231: Base58 benchmark kernel correctly exercises full encoding path.

Uses ergo_checksum and base58_encode_address matching production behavior. XOR-folding encoded bytes prevents optimization.

crates/erg-vanity-gpu/src/lib.rs (1)

3-3: LGTM - Bench module correctly exposed.

Follows existing module pattern and makes the benchmarking API publicly accessible.

crates/erg-vanity-gpu/src/kernel.rs (1)

21-21: BENCH kernel source correctly embedded.

Follows the existing pattern for kernel source constants.

crates/erg-vanity-gpu/src/context.rs (2)

71-78: Clean API extension for profiling support.

The refactor maintains backward compatibility while adding with_device_profiling() for benchmark mode. Existing callers of with_device() are unaffected.

107-112: Profiling flag correctly applied to command queue.

The PROFILING_ENABLE property is conditionally set based on the enable_profiling parameter, which is required for OpenCL event timestamp extraction in the benchmark runner.

crates/erg-vanity-cli/src/main.rs (2)

62-84: Well-structured CLI arguments for benchmark mode.

Clear argument names with sensible defaults. The bench_num_indices defaulting to --index value is a good UX choice for consistency.

554-585: Benchmark mode implementation is correct.

The benchmark path correctly:

Runs before pattern validation (as intended)

Reuses device list parsing

Handles errors with appropriate exit codes

Aggregates and prints results for all devices

crates/erg-vanity-gpu/src/bench.rs (7)

12-37: BenchConfig with sensible defaults.

Default batch size of 262,144 matches production, and 100 iterations with 5 warmup provides statistically meaningful results.

70-105: Benchmark initialization is well-structured.

Profiling-enabled context correctly created

Buffer allocation upfront avoids allocation overhead during timing

Conditional read_write vs write_only flags based on validation mode is a good optimization

111-161: Kernels built once and reused - good practice.

Building kernels upfront avoids JIT compilation overhead during timed iterations. The uniform signature across all kernels simplifies the benchmarking loop.

164-184: Validation phase provides useful sanity checking.

Running each kernel once with unique counter offsets and checking for degenerate checksums helps detect optimization artifacts that would invalidate benchmark results.

283-292: Warmup loop correctly varies counter to avoid caching.

The counter variation during warmup ensures each iteration processes different data, preventing unrealistic cache hit rates.

313-317: Good defensive validation of profiling timestamps.

Checking for zero or invalid timestamps catches cases where profiling isn't properly enabled, providing a clear error message.

345-397: Clear per-device output with useful metrics.

The table format shows total time, percentage breakdown, average per iteration, and per-unit cost (ns/seed or ns/addr) - all valuable for understanding performance characteristics.

- Fix uninitialized byte 32 in bench_base58 base_pubkey array - Fix validate_checksums to handle batch_size < 16 - Skip all_identical check when n == 1 (trivially true) - Use cfg.batch_size instead of buffer API for robustness - Run cargo fmt 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

chatgpt-codex-connector bot reviewed Jan 3, 2026

View reviewed changes

crates/erg-vanity-gpu/kernels/bench.cl Show resolved Hide resolved

crates/erg-vanity-gpu/src/bench.rs Outdated Show resolved Hide resolved

coderabbitai bot reviewed Jan 3, 2026

View reviewed changes

arkadianet merged commit 9f1c9ce into main Jan 5, 2026
5 checks passed

arkadianet deleted the testing branch January 5, 2026 01:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GPU microbenchmark mode with per-component timing#6

Add GPU microbenchmark mode with per-component timing#6
arkadianet merged 2 commits intomainfrom
testing

arkadianet commented Jan 3, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jan 3, 2026 •

edited

Loading

Other AI code review bot(s) detected

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested labels

Poem

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

arkadianet commented Jan 3, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Other AI code review bot(s) detected

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested labels

Poem

Pre-merge checks and finishing touches

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

arkadianet commented Jan 3, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 3, 2026 •

edited

Loading