Skip to content

Add GPU microbenchmark mode with per-component timing#6

Merged
arkadianet merged 2 commits intomainfrom
testing
Jan 5, 2026
Merged

Add GPU microbenchmark mode with per-component timing#6
arkadianet merged 2 commits intomainfrom
testing

Conversation

@arkadianet
Copy link
Owner

@arkadianet arkadianet commented Jan 3, 2026

Add --bench flag to measure individual GPU kernel performance using OpenCL event profiling timestamps:

  • PBKDF2: BIP39 seed derivation (2048 iterations)
  • BIP32: Key derivation for m/44'/429'/0'/0 path
  • secp256k1: Public key generation from private key
  • Base58: Address encoding with Blake2b checksum

Features:

  • Per-device and combined multi-GPU statistics
  • Configurable iterations, warmup, batch size, and num_indices
  • --bench-validate flag to verify kernels aren't optimized away
  • Uses exact production codepaths for accurate measurements

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features
    • Added a GPU benchmark mode in the CLI to measure cryptographic pipeline performance across devices.
    • New benchmark options: iterations, warmup runs, batch size, per-index count, and optional result validation.
    • Per-device and combined reports with timing breakdowns for PBKDF2, BIP32, secp256k1, and Base58 stages; supports multi-device aggregation and profiling-enabled device selection.

✏️ Tip: You can customize this high-level summary in your review settings.

Add --bench flag to measure individual GPU kernel performance using
OpenCL event profiling timestamps:
- PBKDF2: BIP39 seed derivation (2048 iterations)
- BIP32: Key derivation for m/44'/429'/0'/0 path
- secp256k1: Public key generation from private key
- Base58: Address encoding with Blake2b checksum

Features:
- Per-device and combined multi-GPU statistics
- Configurable iterations, warmup, batch size, and num_indices
- --bench-validate flag to verify kernels aren't optimized away
- Uses exact production codepaths for accurate measurements

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@coderabbitai
Copy link

coderabbitai bot commented Jan 3, 2026

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough

Walkthrough

Adds a GPU benchmarking feature: CLI flags to configure benchmarks, new OpenCL benchmark kernels, a benchmark runner with profiling/validation/reporting, profiling-enabled GPU context creation, and program-building support for benchmark kernels. The CLI can run per-device benchmarks and print aggregated results.

Changes

Cohort / File(s) Summary
CLI Benchmark Configuration
crates/erg-vanity-cli/src/main.rs
Added benchmark args to Args (bench, bench_iters, bench_warmup, bench_batch_size, bench_num_indices, bench_validate); main now runs per-device benchmark flow when --bench is set and exits before pattern parsing.
GPU Benchmark Kernels
crates/erg-vanity-gpu/kernels/bench.cl
New OpenCL source with deterministic entropy/seed helpers and four benchmark kernels (bench_pbkdf2, bench_bip32, bench_secp256k1, bench_base58) that produce per-item checksums to avoid dead-code elimination.
GPU Benchmark Runner
crates/erg-vanity-gpu/src/bench.rs
New public BenchConfig, ComponentStats, DeviceBenchStats; run_bench_on_device() runs warmup/timed iterations with profiling, optional validation, and returns per-component timings; print_bench_results() prints per-device and combined summaries.
GPU Context Profiling
crates/erg-vanity-gpu/src/context.rs
Added GpuContext::with_device_profiling(global_selection) and internal with_device_impl(..., enable_profiling); command queue is created with profiling enabled when requested.
GPU Kernel Program Builder
crates/erg-vanity-gpu/src/kernel.rs
Added BENCH source constant and GpuProgram::bench(ctx) to compile a benchmarking program by composing required kernel sources (bench.cl + dependencies).
GPU Module Export
crates/erg-vanity-gpu/src/lib.rs
Exported new bench module (pub mod bench;) to expose the benchmarking API.

Sequence Diagram

sequenceDiagram
    actor User
    participant CLI as CLI (main.rs)
    participant BenchMod as Bench Runner (bench.rs)
    participant GpuCtx as GPU Context (context.rs)
    participant KernProg as Kernel Program (kernel.rs)
    participant OCL as OpenCL Runtime

    User->>CLI: run with --bench and args
    CLI->>CLI: parse args -> BenchConfig
    CLI->>CLI: parse device list

    loop per selected device
        CLI->>BenchMod: run_bench_on_device(device_idx, cfg)

        BenchMod->>GpuCtx: with_device_profiling(device_idx)
        GpuCtx->>OCL: create context & profiling-enabled queue
        OCL-->>GpuCtx: context ready
        GpuCtx-->>BenchMod: GpuContext

        BenchMod->>KernProg: bench(ctx)
        KernProg->>OCL: compile bench.cl + deps
        OCL-->>KernProg: program built
        KernProg-->>BenchMod: GpuProgram

        BenchMod->>OCL: allocate/upload buffers (salt, wordlist, checksums)
        loop for each kernel (pbkdf2,bip32,secp256k1,base58)
            BenchMod->>OCL: run warmup iterations
            OCL-->>BenchMod: complete
            loop timed iterations
                BenchMod->>OCL: enqueue kernel (profiling)
                OCL-->>BenchMod: profiling timestamps
                BenchMod->>BenchMod: extract ns, accumulate ComponentStats
            end
            opt validate
                BenchMod->>BenchMod: validate checksums
            end
        end
        BenchMod-->>CLI: return DeviceBenchStats
    end

    CLI->>BenchMod: print_bench_results(results, cfg)
    BenchMod->>User: formatted per-device & combined tables
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Suggested labels

codex

Poem

🐰 I hopped on silicon light,
Benchmarks humming through the night,
Kernels clack and timers sing,
Checksums dance—results I bring,
Tiny paws applaud the flight.

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Add GPU microbenchmark mode with per-component timing' accurately summarizes the main objective of the pull request, which adds benchmarking functionality to measure individual GPU kernel performance with timing breakdowns.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings

📜 Recent review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c3925f1 and 5e6225c.

📒 Files selected for processing (2)
  • crates/erg-vanity-gpu/kernels/bench.cl
  • crates/erg-vanity-gpu/src/bench.rs
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Test
🔇 Additional comments (13)
crates/erg-vanity-gpu/kernels/bench.cl (6)

10-37: LGTM!

The entropy generation correctly mixes salt, counter, and gid via Blake2b-256, ensuring deterministic and well-distributed output across work items.


39-53: LGTM!

Simple and effective approach to generate a deterministic 64-byte seed from 32-byte entropy using XOR perturbation.


62-87: LGTM!

The kernel correctly exercises the production bip39_entropy_to_seed codepath, and the checksum folding properly distributes bytes across the uint to prevent optimization.


96-128: LGTM!

Correctly isolates BIP32 derivation timing by skipping PBKDF2. The sentinel value 0xDEAD0001u on chain derivation failure aids debugging.


138-175: LGTM!

The kernel correctly exercises the production secp256k1 codepath (scalar parsing, generator multiplication, pubkey compression). The per-index key variation via XOR is sufficient for benchmark purposes.


203-232: LGTM!

The kernel correctly exercises the production Base58 encoding path with Blake2b checksum computation. The output folding prevents compiler optimization.

crates/erg-vanity-gpu/src/bench.rs (7)

12-37: LGTM!

Reasonable default configuration for GPU benchmarking. The defaults provide a good balance between measurement accuracy and execution time.


39-68: LGTM!

Clean data structures for collecting benchmark statistics with appropriate derives.


70-203: LGTM!

Well-structured benchmark runner with efficient buffer reuse and proper separation of warmup from timed iterations. The counter offsets effectively prevent cross-component caching artifacts.


273-282: LGTM!

Clean extraction of profiling timestamps from OpenCL events.


284-335: LGTM!

Solid benchmark timing implementation with proper validation of profiling timestamps and counter variation to prevent caching effects.


337-406: LGTM!

Clear and informative benchmark output with proper sorting by total time to highlight bottlenecks. The count.max(1) guards against division by zero for average calculations.


408-474: LGTM!

Combined statistics aggregation correctly sums across devices and maintains consistent formatting with the per-device output.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c3925f1b6b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
crates/erg-vanity-gpu/kernels/bench.cl (1)

119-126: Silent continue on address derivation failure may mask kernel issues.

When bip32_derive_address_index fails, the loop silently continues without contributing to the checksum. This is fine for benchmarking valid paths, but if many derivations fail (e.g., due to a bug), the benchmark would appear faster than reality while producing misleading checksums.

Consider tracking failure counts or using a different sentinel pattern for partial failures if validation accuracy is important.

crates/erg-vanity-gpu/src/kernel.rs (1)

204-258: Consider extracting shared kernel concatenation logic.

The bench() and vanity() functions share ~90% identical code for concatenating kernel sources. While the current approach is clear and explicit, a helper function could reduce maintenance burden.

🔎 Possible refactor approach
fn build_combined_source(final_kernel: &str, final_kernel_name: &str) -> String {
    let mut combined = String::with_capacity(
        sources::SHA256.len()
            + sources::SHA512.len()
            // ... rest of sources ...
            + final_kernel.len()
            + 1024,
    );
    
    combined.push_str("// === sha256.cl ===\n");
    combined.push_str(sources::SHA256);
    // ... shared concatenation ...
    
    combined.push_str(&format!("\n\n// === {} ===\n", final_kernel_name));
    combined.push_str(final_kernel);
    combined.push('\n');
    combined
}
crates/erg-vanity-gpu/src/bench.rs (1)

297-320: Potential overflow in counter calculation for large configurations.

The counter calculation counter_offset + ((cfg.warmup as u64) + (iter as u64)) * (cfg.batch_size as u64) could overflow for extreme configurations (e.g., very large batch sizes with many iterations). While unlikely in practice with typical benchmark parameters, consider using saturating_mul or checked_mul for defensive coding.

🔎 Defensive overflow handling
         let counter_start =
-            counter_offset + ((cfg.warmup as u64) + (iter as u64)) * (cfg.batch_size as u64);
+            counter_offset
+                .saturating_add(
+                    ((cfg.warmup as u64) + (iter as u64))
+                        .saturating_mul(cfg.batch_size as u64)
+                );
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fd8ffc5 and c3925f1.

📒 Files selected for processing (6)
  • crates/erg-vanity-cli/src/main.rs
  • crates/erg-vanity-gpu/kernels/bench.cl
  • crates/erg-vanity-gpu/src/bench.rs
  • crates/erg-vanity-gpu/src/context.rs
  • crates/erg-vanity-gpu/src/kernel.rs
  • crates/erg-vanity-gpu/src/lib.rs
🧰 Additional context used
🧬 Code graph analysis (2)
crates/erg-vanity-gpu/src/lib.rs (1)
crates/erg-vanity-gpu/src/kernel.rs (1)
  • bench (208-258)
crates/erg-vanity-cli/src/main.rs (1)
crates/erg-vanity-gpu/src/bench.rs (2)
  • run_bench_on_device (71-200)
  • print_bench_results (329-343)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Test
🔇 Additional comments (19)
crates/erg-vanity-gpu/kernels/bench.cl (6)

1-9: Well-documented kernel dependencies and purpose.

Clear header documenting required kernel concatenation order and design goal of using exact production codepaths.


13-37: LGTM - Deterministic entropy generation.

Properly mixes gid, counter_start, and salt using Blake2b to produce unique 32-byte entropy per work item. The comment notes this is copied verbatim from vanity.cl, ensuring consistent behavior.


41-53: Seed generation approach is sound.

Generates 64-byte seed by hashing entropy twice with XOR perturbation. This provides deterministic, varied seeds for benchmarking BIP32/secp/base58 without PBKDF2 overhead.


62-87: PBKDF2 benchmark kernel looks correct.

Uses the exact production bip39_entropy_to_seed call and XOR-folds the 64-byte seed into a checksum to prevent dead-code elimination.


153-159: Private key variation approach is reasonable for isolation.

XOR-modifying base_privkey with addr_idx provides unique keys per iteration without BIP32 overhead. This isolates secp256k1 timing from other components.


188-231: Base58 benchmark kernel correctly exercises full encoding path.

Uses ergo_checksum and base58_encode_address matching production behavior. XOR-folding encoded bytes prevents optimization.

crates/erg-vanity-gpu/src/lib.rs (1)

3-3: LGTM - Bench module correctly exposed.

Follows existing module pattern and makes the benchmarking API publicly accessible.

crates/erg-vanity-gpu/src/kernel.rs (1)

21-21: BENCH kernel source correctly embedded.

Follows the existing pattern for kernel source constants.

crates/erg-vanity-gpu/src/context.rs (2)

71-78: Clean API extension for profiling support.

The refactor maintains backward compatibility while adding with_device_profiling() for benchmark mode. Existing callers of with_device() are unaffected.


107-112: Profiling flag correctly applied to command queue.

The PROFILING_ENABLE property is conditionally set based on the enable_profiling parameter, which is required for OpenCL event timestamp extraction in the benchmark runner.

crates/erg-vanity-cli/src/main.rs (2)

62-84: Well-structured CLI arguments for benchmark mode.

Clear argument names with sensible defaults. The bench_num_indices defaulting to --index value is a good UX choice for consistency.


554-585: Benchmark mode implementation is correct.

The benchmark path correctly:

  • Runs before pattern validation (as intended)
  • Reuses device list parsing
  • Handles errors with appropriate exit codes
  • Aggregates and prints results for all devices
crates/erg-vanity-gpu/src/bench.rs (7)

12-37: BenchConfig with sensible defaults.

Default batch size of 262,144 matches production, and 100 iterations with 5 warmup provides statistically meaningful results.


70-105: Benchmark initialization is well-structured.

  • Profiling-enabled context correctly created
  • Buffer allocation upfront avoids allocation overhead during timing
  • Conditional read_write vs write_only flags based on validation mode is a good optimization

111-161: Kernels built once and reused - good practice.

Building kernels upfront avoids JIT compilation overhead during timed iterations. The uniform signature across all kernels simplifies the benchmarking loop.


164-184: Validation phase provides useful sanity checking.

Running each kernel once with unique counter offsets and checking for degenerate checksums helps detect optimization artifacts that would invalidate benchmark results.


283-292: Warmup loop correctly varies counter to avoid caching.

The counter variation during warmup ensures each iteration processes different data, preventing unrealistic cache hit rates.


313-317: Good defensive validation of profiling timestamps.

Checking for zero or invalid timestamps catches cases where profiling isn't properly enabled, providing a clear error message.


345-397: Clear per-device output with useful metrics.

The table format shows total time, percentage breakdown, average per iteration, and per-unit cost (ns/seed or ns/addr) - all valuable for understanding performance characteristics.

- Fix uninitialized byte 32 in bench_base58 base_pubkey array
- Fix validate_checksums to handle batch_size < 16
- Skip all_identical check when n == 1 (trivially true)
- Use cfg.batch_size instead of buffer API for robustness
- Run cargo fmt

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@arkadianet arkadianet merged commit 9f1c9ce into main Jan 5, 2026
5 checks passed
@arkadianet arkadianet deleted the testing branch January 5, 2026 01:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant