Add GPU microbenchmark mode with per-component timing#6
Conversation
Add --bench flag to measure individual GPU kernel performance using OpenCL event profiling timestamps: - PBKDF2: BIP39 seed derivation (2048 iterations) - BIP32: Key derivation for m/44'/429'/0'/0 path - secp256k1: Public key generation from private key - Base58: Address encoding with Blake2b checksum Features: - Per-device and combined multi-GPU statistics - Configurable iterations, warmup, batch size, and num_indices - --bench-validate flag to verify kernels aren't optimized away - Uses exact production codepaths for accurate measurements 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. 📝 WalkthroughWalkthroughAdds a GPU benchmarking feature: CLI flags to configure benchmarks, new OpenCL benchmark kernels, a benchmark runner with profiling/validation/reporting, profiling-enabled GPU context creation, and program-building support for benchmark kernels. The CLI can run per-device benchmarks and print aggregated results. Changes
Sequence DiagramsequenceDiagram
actor User
participant CLI as CLI (main.rs)
participant BenchMod as Bench Runner (bench.rs)
participant GpuCtx as GPU Context (context.rs)
participant KernProg as Kernel Program (kernel.rs)
participant OCL as OpenCL Runtime
User->>CLI: run with --bench and args
CLI->>CLI: parse args -> BenchConfig
CLI->>CLI: parse device list
loop per selected device
CLI->>BenchMod: run_bench_on_device(device_idx, cfg)
BenchMod->>GpuCtx: with_device_profiling(device_idx)
GpuCtx->>OCL: create context & profiling-enabled queue
OCL-->>GpuCtx: context ready
GpuCtx-->>BenchMod: GpuContext
BenchMod->>KernProg: bench(ctx)
KernProg->>OCL: compile bench.cl + deps
OCL-->>KernProg: program built
KernProg-->>BenchMod: GpuProgram
BenchMod->>OCL: allocate/upload buffers (salt, wordlist, checksums)
loop for each kernel (pbkdf2,bip32,secp256k1,base58)
BenchMod->>OCL: run warmup iterations
OCL-->>BenchMod: complete
loop timed iterations
BenchMod->>OCL: enqueue kernel (profiling)
OCL-->>BenchMod: profiling timestamps
BenchMod->>BenchMod: extract ns, accumulate ComponentStats
end
opt validate
BenchMod->>BenchMod: validate checksums
end
end
BenchMod-->>CLI: return DeviceBenchStats
end
CLI->>BenchMod: print_bench_results(results, cfg)
BenchMod->>User: formatted per-device & combined tables
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Suggested labels
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
📜 Recent review detailsConfiguration used: defaults Review profile: CHILL Plan: Pro 📒 Files selected for processing (2)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
🔇 Additional comments (13)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c3925f1b6b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (3)
crates/erg-vanity-gpu/kernels/bench.cl (1)
119-126: Silentcontinueon address derivation failure may mask kernel issues.When
bip32_derive_address_indexfails, the loop silently continues without contributing to the checksum. This is fine for benchmarking valid paths, but if many derivations fail (e.g., due to a bug), the benchmark would appear faster than reality while producing misleading checksums.Consider tracking failure counts or using a different sentinel pattern for partial failures if validation accuracy is important.
crates/erg-vanity-gpu/src/kernel.rs (1)
204-258: Consider extracting shared kernel concatenation logic.The
bench()andvanity()functions share ~90% identical code for concatenating kernel sources. While the current approach is clear and explicit, a helper function could reduce maintenance burden.🔎 Possible refactor approach
fn build_combined_source(final_kernel: &str, final_kernel_name: &str) -> String { let mut combined = String::with_capacity( sources::SHA256.len() + sources::SHA512.len() // ... rest of sources ... + final_kernel.len() + 1024, ); combined.push_str("// === sha256.cl ===\n"); combined.push_str(sources::SHA256); // ... shared concatenation ... combined.push_str(&format!("\n\n// === {} ===\n", final_kernel_name)); combined.push_str(final_kernel); combined.push('\n'); combined }crates/erg-vanity-gpu/src/bench.rs (1)
297-320: Potential overflow in counter calculation for large configurations.The counter calculation
counter_offset + ((cfg.warmup as u64) + (iter as u64)) * (cfg.batch_size as u64)could overflow for extreme configurations (e.g., very large batch sizes with many iterations). While unlikely in practice with typical benchmark parameters, consider usingsaturating_mulorchecked_mulfor defensive coding.🔎 Defensive overflow handling
let counter_start = - counter_offset + ((cfg.warmup as u64) + (iter as u64)) * (cfg.batch_size as u64); + counter_offset + .saturating_add( + ((cfg.warmup as u64) + (iter as u64)) + .saturating_mul(cfg.batch_size as u64) + );
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
crates/erg-vanity-cli/src/main.rscrates/erg-vanity-gpu/kernels/bench.clcrates/erg-vanity-gpu/src/bench.rscrates/erg-vanity-gpu/src/context.rscrates/erg-vanity-gpu/src/kernel.rscrates/erg-vanity-gpu/src/lib.rs
🧰 Additional context used
🧬 Code graph analysis (2)
crates/erg-vanity-gpu/src/lib.rs (1)
crates/erg-vanity-gpu/src/kernel.rs (1)
bench(208-258)
crates/erg-vanity-cli/src/main.rs (1)
crates/erg-vanity-gpu/src/bench.rs (2)
run_bench_on_device(71-200)print_bench_results(329-343)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Test
🔇 Additional comments (19)
crates/erg-vanity-gpu/kernels/bench.cl (6)
1-9: Well-documented kernel dependencies and purpose.Clear header documenting required kernel concatenation order and design goal of using exact production codepaths.
13-37: LGTM - Deterministic entropy generation.Properly mixes
gid,counter_start, andsaltusing Blake2b to produce unique 32-byte entropy per work item. The comment notes this is copied verbatim fromvanity.cl, ensuring consistent behavior.
41-53: Seed generation approach is sound.Generates 64-byte seed by hashing entropy twice with XOR perturbation. This provides deterministic, varied seeds for benchmarking BIP32/secp/base58 without PBKDF2 overhead.
62-87: PBKDF2 benchmark kernel looks correct.Uses the exact production
bip39_entropy_to_seedcall and XOR-folds the 64-byte seed into a checksum to prevent dead-code elimination.
153-159: Private key variation approach is reasonable for isolation.XOR-modifying
base_privkeywithaddr_idxprovides unique keys per iteration without BIP32 overhead. This isolates secp256k1 timing from other components.
188-231: Base58 benchmark kernel correctly exercises full encoding path.Uses
ergo_checksumandbase58_encode_addressmatching production behavior. XOR-folding encoded bytes prevents optimization.crates/erg-vanity-gpu/src/lib.rs (1)
3-3: LGTM - Bench module correctly exposed.Follows existing module pattern and makes the benchmarking API publicly accessible.
crates/erg-vanity-gpu/src/kernel.rs (1)
21-21: BENCH kernel source correctly embedded.Follows the existing pattern for kernel source constants.
crates/erg-vanity-gpu/src/context.rs (2)
71-78: Clean API extension for profiling support.The refactor maintains backward compatibility while adding
with_device_profiling()for benchmark mode. Existing callers ofwith_device()are unaffected.
107-112: Profiling flag correctly applied to command queue.The
PROFILING_ENABLEproperty is conditionally set based on theenable_profilingparameter, which is required for OpenCL event timestamp extraction in the benchmark runner.crates/erg-vanity-cli/src/main.rs (2)
62-84: Well-structured CLI arguments for benchmark mode.Clear argument names with sensible defaults. The
bench_num_indicesdefaulting to--indexvalue is a good UX choice for consistency.
554-585: Benchmark mode implementation is correct.The benchmark path correctly:
- Runs before pattern validation (as intended)
- Reuses device list parsing
- Handles errors with appropriate exit codes
- Aggregates and prints results for all devices
crates/erg-vanity-gpu/src/bench.rs (7)
12-37: BenchConfig with sensible defaults.Default batch size of 262,144 matches production, and 100 iterations with 5 warmup provides statistically meaningful results.
70-105: Benchmark initialization is well-structured.
- Profiling-enabled context correctly created
- Buffer allocation upfront avoids allocation overhead during timing
- Conditional read_write vs write_only flags based on validation mode is a good optimization
111-161: Kernels built once and reused - good practice.Building kernels upfront avoids JIT compilation overhead during timed iterations. The uniform signature across all kernels simplifies the benchmarking loop.
164-184: Validation phase provides useful sanity checking.Running each kernel once with unique counter offsets and checking for degenerate checksums helps detect optimization artifacts that would invalidate benchmark results.
283-292: Warmup loop correctly varies counter to avoid caching.The counter variation during warmup ensures each iteration processes different data, preventing unrealistic cache hit rates.
313-317: Good defensive validation of profiling timestamps.Checking for zero or invalid timestamps catches cases where profiling isn't properly enabled, providing a clear error message.
345-397: Clear per-device output with useful metrics.The table format shows total time, percentage breakdown, average per iteration, and per-unit cost (ns/seed or ns/addr) - all valuable for understanding performance characteristics.
- Fix uninitialized byte 32 in bench_base58 base_pubkey array - Fix validate_checksums to handle batch_size < 16 - Skip all_identical check when n == 1 (trivially true) - Use cfg.batch_size instead of buffer API for robustness - Run cargo fmt 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add --bench flag to measure individual GPU kernel performance using OpenCL event profiling timestamps:
Features:
🤖 Generated with Claude Code
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.