feat(gpu): implement 4-bit windowed scalar multiplication for generator#8
feat(gpu): implement 4-bit windowed scalar multiplication for generator#8arkadianet merged 1 commit intomainfrom
Conversation
📝 WalkthroughWalkthroughAdds a precomputed G_TABLE constant storing 16 windowed multiples of the secp256k1 generator in Jacobian coordinates to the GPU kernel, and introduces a new Rust binary tool that generates this table by computing i·G for i = 0..15 and formatting the output for kernel integration. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Replace naive bit-by-bit double-and-add with MSB-first fixed-window method using precomputed generator multiples. Changes: - Add G_TABLE[16][24] with precomputed 0*G through 15*G in Jacobian coords - Replace pt_mul_generator() with windowed implementation - Add gen_g_table binary to generate/verify table constants Performance: - Overall throughput: 276k → 311k addr/s (+12.6%) All 30 tests pass including CPU/GPU consistency check. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
19ac415 to
a419894
Compare
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
crates/erg-vanity-gpu/kernels/secp256k1_point.cl (1)
317-340: Optional: Consider adding a comment to clarify the ping-pong buffer swap pattern.The manual pointer swaps achieve zero-copy ping-ponging, which is excellent for performance. However, the swap logic (especially the difference between the loop swaps and the single braced swaps) could benefit from a brief comment explaining that after processing both nibbles,
accreturns to its original buffer.This is a minor readability suggestion—the implementation is correct and efficiently avoids unnecessary copying.
Example clarifying comment
// Process 32 bytes MSB-first, high nibble then low nibble per byte + // Ping-pong buffers: after each nibble (4 doubles + 1 add), pointers swap + // After both nibbles per byte, acc returns to original buffer (2 total swaps) for (int byte_idx = 0; byte_idx < 32; byte_idx++) {
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
crates/erg-vanity-gpu/kernels/secp256k1_point.clcrates/erg-vanity-gpu/src/bin/gen_g_table.rs
🚧 Files skipped from review as they are similar to previous changes (1)
- crates/erg-vanity-gpu/src/bin/gen_g_table.rs
🔇 Additional comments (2)
crates/erg-vanity-gpu/kernels/secp256k1_point.cl (2)
25-110: G_TABLE[1] (generator point) correctly matches GX_BYTES and GY_BYTES constants.Verified that the precomputed entry for 1*G matches the generator coordinates when converted from big-endian bytes to little-endian 32-bit limbs, confirming correct encoding in the table. The Z-coordinate is correctly set to 1 (affine point in Jacobian representation). The gen_g_table.rs tool exists and was used to generate the table as documented.
308-344: The byte ordering assumption is correct. Thesc_to_bytesfunction insecp256k1_scalar.cl(lines 159-167) explicitly produces big-endian bytes by mapping limb 0 (LSB) to bytes[28-31] and limb 7 (MSB) to bytes[0-3]. This meanspt_mul_generatorcorrectly processes the scalar MSB-first by iterating from byte index 0 to 31, and the windowed multiplication algorithm computes the intended scalar value.
Replace naive bit-by-bit double-and-add with MSB-first fixed-window method using precomputed generator multiples.
Changes:
Performance:
All 30 tests pass including CPU/GPU consistency check.
🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Documentation
✏️ Tip: You can customize this high-level summary in your review settings.