feat(gpu): implement 4-bit windowed scalar multiplication for generator by arkadianet · Pull Request #8 · arkadianet/erg-vanity-gpu

arkadianet · 2026-01-04T23:39:04Z

Replace naive bit-by-bit double-and-add with MSB-first fixed-window method using precomputed generator multiples.

Changes:

Add G_TABLE[16][24] with precomputed 0G through 15G in Jacobian coords
Replace pt_mul_generator() with windowed implementation
Add gen_g_table binary to generate/verify table constants

Performance:

Overall throughput: 276k → 311k addr/s (+12.6%)

All 30 tests pass including CPU/GPU consistency check.

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Faster GPU secp256k1 scalar multiplication using precomputed lookup tables for generator operations.
- Added a utility to generate and verify the precomputed table used by GPU kernels.
Documentation
- Added explanatory metadata and comments describing the table layout and generation/verification process.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2026-01-04T23:39:15Z

📝 Walkthrough

Walkthrough

Adds a precomputed G_TABLE constant storing 16 windowed multiples of the secp256k1 generator in Jacobian coordinates to the GPU kernel, and introduces a new Rust binary tool that generates this table by computing i·G for i = 0..15 and formatting the output for kernel integration.

Changes

Cohort / File(s)	Summary
GPU Kernel: Generator Windowing `crates/erg-vanity-gpu/kernels/secp256k1_point.cl`	Added public constant `G_TABLE[16][24]` storing 16 precomputed multiples of generator G in Jacobian form (X, Y, Z each 8 limbs). Replaced `pt_mul_generator` implementation with a 4-bit windowed nibble-based algorithm: converts scalar to bytes, processes 32 bytes MSB-first by nibble, doubles accumulators, and adds selected precomputed points using ping-pong buffers.
Table Generation Tool `crates/erg-vanity-gpu/src/bin/gen_g_table.rs`	New Rust binary that generates `G_TABLE` by computing i·G (i = 0..15), converting coordinates to the kernel's 8-limb little-endian representation, and printing a C-style static initializer. Includes `bytes_to_limbs` helper and validation against kernel GX/GY constants.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 I hopped through bits and nibbles bright,

Sixteen Gs I carried into night,
Limbs aligned, in Jacobian song,
Ping-pong adds make scalars strong,
A rabbit cheers — the table's right!

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: implementing a 4-bit windowed scalar multiplication approach for the generator point, which is the core objective of the PR.

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Replace naive bit-by-bit double-and-add with MSB-first fixed-window method using precomputed generator multiples. Changes: - Add G_TABLE[16][24] with precomputed 0*G through 15*G in Jacobian coords - Replace pt_mul_generator() with windowed implementation - Add gen_g_table binary to generate/verify table constants Performance: - Overall throughput: 276k → 311k addr/s (+12.6%) All 30 tests pass including CPU/GPU consistency check. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

crates/erg-vanity-gpu/kernels/secp256k1_point.cl (1)
317-340: Optional: Consider adding a comment to clarify the ping-pong buffer swap pattern.

The manual pointer swaps achieve zero-copy ping-ponging, which is excellent for performance. However, the swap logic (especially the difference between the loop swaps and the single braced swaps) could benefit from a brief comment explaining that after processing both nibbles, acc returns to its original buffer.

This is a minor readability suggestion—the implementation is correct and efficiently avoids unnecessary copying.
Example clarifying comment
     // Process 32 bytes MSB-first, high nibble then low nibble per byte
+    // Ping-pong buffers: after each nibble (4 doubles + 1 add), pointers swap
+    // After both nibbles per byte, acc returns to original buffer (2 total swaps)
     for (int byte_idx = 0; byte_idx < 32; byte_idx++) {

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 19ac415 and a419894.

📒 Files selected for processing (2)

crates/erg-vanity-gpu/kernels/secp256k1_point.cl
crates/erg-vanity-gpu/src/bin/gen_g_table.rs

🚧 Files skipped from review as they are similar to previous changes (1)

crates/erg-vanity-gpu/src/bin/gen_g_table.rs

🔇 Additional comments (2)

crates/erg-vanity-gpu/kernels/secp256k1_point.cl (2)

25-110: G_TABLE[1] (generator point) correctly matches GX_BYTES and GY_BYTES constants.

Verified that the precomputed entry for 1*G matches the generator coordinates when converted from big-endian bytes to little-endian 32-bit limbs, confirming correct encoding in the table. The Z-coordinate is correctly set to 1 (affine point in Jacobian representation). The gen_g_table.rs tool exists and was used to generate the table as documented.

308-344: The byte ordering assumption is correct. The sc_to_bytes function in secp256k1_scalar.cl (lines 159-167) explicitly produces big-endian bytes by mapping limb 0 (LSB) to bytes[28-31] and limb 7 (MSB) to bytes[0-3]. This means pt_mul_generator correctly processes the scalar MSB-first by iterating from byte index 0 to 31, and the windowed multiplication algorithm computes the intended scalar value.

arkadianet force-pushed the feat/windowed-scalar-mul branch from 19ac415 to a419894 Compare January 4, 2026 23:41

coderabbitai bot reviewed Jan 4, 2026

View reviewed changes

arkadianet merged commit 1d9896c into main Jan 4, 2026
5 checks passed

arkadianet deleted the feat/windowed-scalar-mul branch January 4, 2026 23:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(gpu): implement 4-bit windowed scalar multiplication for generator#8

feat(gpu): implement 4-bit windowed scalar multiplication for generator#8
arkadianet merged 1 commit intomainfrom
feat/windowed-scalar-mul

arkadianet commented Jan 4, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jan 4, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

arkadianet commented Jan 4, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

arkadianet commented Jan 4, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 4, 2026 •

edited

Loading