gf256: precompute full multiplication tables#33
gf256: precompute full multiplication tables#33mplekh wants to merge 1 commit intoitzmeanjan:mainfrom
Conversation
…scalar fallback Replace per-call GF(256) multiplication with a compile-time generated 256×256 lookup table and use it in the scalar gf256_mul_vec_by_scalar_then_add_into_vec fallback path. This removes runtime table construction and per-element GF arithmetic, reducing the hot loop to indexed loads and XORs. SIMD dispatch remains unchanged; SIMD backends continue to use their specialized implementations. Performance impact: Significant improvements on non-SIMD targets (e.g. WASM, legacy CPUs): Benchmark on AMD Phenom II X6 (encode, 1 MB): 16 pieces: −56–58% time, +128–136% throughput 32 pieces: −55–56% time, +123–126% throughput 64 pieces: −55–56% time, +124–130% throughput
|
Hello @mplekh , thanks for the PR. I see we are not doing GF2p8 multiplication for non-simd case, actually we are looking up from two tables. And those tables are much smaller in size. In your optimization we need to store extra 64kB for constants. Anyway good speedup. How did you do this benchmark evaluation? I'm curious to check it myself. |
|
Hi Anjan, |
use table lookup in scalar fallback
Replace per-call GF(256) multiplication with a compile-time generated 256×256 lookup table
and use it in the scalar gf256_mul_vec_by_scalar_then_add_into_vec fallback path.
This removes runtime table construction and per-element GF arithmetic, reducing the hot loop to indexed loads and XORs.
SIMD dispatch remains unchanged; SIMD backends continue to use their specialized implementations.
Performance impact:
Significant improvements on non-SIMD targets (e.g. WASM, legacy CPUs):
AMD Phenom II x6:
encode/1.0MB/16-pieces time: [1.1116 ms 1.1245 ms 1.1343 ms]
thrpt: [936.70 MiB/s 944.93 MiB/s 955.83 MiB/s]
change:
time: [−57.581% −56.812% −56.132%] (p = 0.00 < 0.05)
thrpt: [+127.96% +131.55% +135.74%]
Performance has improved.
WASM (Intel(R) Core(TM) i5-7300U CPU @ 2.60GHz):
cargo bench --target wasm32-wasip1 --bench full_rlnc_encoder
before:
encode/1.0MB/16-pieces time: [2.1627 ms 2.1706 ms 2.1798 ms]
thrpt: [487.44 MiB/s 489.52 MiB/s 491.29 MiB/s]
encode/1.0MB/32-pieces time: [2.3975 ms 2.5320 ms 2.6811 ms]
thrpt: [384.66 MiB/s 407.31 MiB/s 430.16 MiB/s]
after:
encode/1.0MB/16-pieces time: [1.1017 ms 1.1120 ms 1.1242 ms]
thrpt: [945.11 MiB/s 955.56 MiB/s 964.42 MiB/s]
encode/1.0MB/32-pieces time: [1.1202 ms 1.1278 ms 1.1372 ms]
thrpt: [906.85 MiB/s 914.45 MiB/s 920.67 MiB/s]