Skip to content

gf256: precompute full multiplication tables#33

Open
mplekh wants to merge 1 commit intoitzmeanjan:mainfrom
mplekh:gf256-precomputed-mul-table
Open

gf256: precompute full multiplication tables#33
mplekh wants to merge 1 commit intoitzmeanjan:mainfrom
mplekh:gf256-precomputed-mul-table

Conversation

@mplekh
Copy link

@mplekh mplekh commented Jan 29, 2026

use table lookup in scalar fallback

Replace per-call GF(256) multiplication with a compile-time generated 256×256 lookup table
and use it in the scalar gf256_mul_vec_by_scalar_then_add_into_vec fallback path.

This removes runtime table construction and per-element GF arithmetic, reducing the hot loop to indexed loads and XORs.
SIMD dispatch remains unchanged; SIMD backends continue to use their specialized implementations.

Performance impact:
Significant improvements on non-SIMD targets (e.g. WASM, legacy CPUs):
AMD Phenom II x6:
encode/1.0MB/16-pieces time: [1.1116 ms 1.1245 ms 1.1343 ms]
thrpt: [936.70 MiB/s 944.93 MiB/s 955.83 MiB/s]
change:
time: [−57.581% −56.812% −56.132%] (p = 0.00 < 0.05)
thrpt: [+127.96% +131.55% +135.74%]
Performance has improved.

WASM (Intel(R) Core(TM) i5-7300U CPU @ 2.60GHz):
cargo bench --target wasm32-wasip1 --bench full_rlnc_encoder
before:
encode/1.0MB/16-pieces time: [2.1627 ms 2.1706 ms 2.1798 ms]
thrpt: [487.44 MiB/s 489.52 MiB/s 491.29 MiB/s]
encode/1.0MB/32-pieces time: [2.3975 ms 2.5320 ms 2.6811 ms]
thrpt: [384.66 MiB/s 407.31 MiB/s 430.16 MiB/s]
after:
encode/1.0MB/16-pieces time: [1.1017 ms 1.1120 ms 1.1242 ms]
thrpt: [945.11 MiB/s 955.56 MiB/s 964.42 MiB/s]
encode/1.0MB/32-pieces time: [1.1202 ms 1.1278 ms 1.1372 ms]
thrpt: [906.85 MiB/s 914.45 MiB/s 920.67 MiB/s]

…scalar fallback

Replace per-call GF(256) multiplication with a compile-time generated 256×256 lookup table
 and use it in the scalar gf256_mul_vec_by_scalar_then_add_into_vec fallback path.

This removes runtime table construction and per-element GF arithmetic,
reducing the hot loop to indexed loads and XORs.
 SIMD dispatch remains unchanged; SIMD backends continue to use their specialized implementations.

Performance impact:
Significant improvements on non-SIMD targets (e.g. WASM, legacy CPUs):

Benchmark on  AMD Phenom II X6 (encode, 1 MB):
16 pieces: −56–58% time, +128–136% throughput
32 pieces: −55–56% time, +123–126% throughput
64 pieces: −55–56% time, +124–130% throughput
@itzmeanjan
Copy link
Owner

Hello @mplekh , thanks for the PR.

I see we are not doing GF2p8 multiplication for non-simd case, actually we are looking up from two tables. And those tables are much smaller in size. In your optimization we need to store extra 64kB for constants. Anyway good speedup.

How did you do this benchmark evaluation? I'm curious to check it myself.

@mplekh
Copy link
Author

mplekh commented Jan 29, 2026

Hi Anjan,
I've benchmarked on old PC (AMD Phenom II x6, no SIMD) by running "make bench" on clean repo and after applying this change. For WASM, I've run benchmark on newer PC with SIMD just to make sure WASM does not benefit from SIMD specializations and uses fallback. Command used: "cargo bench --target wasm32-wasip1 --bench full_rlnc_encoder"
Results are very similar to benchmark on legacy CPU.
Also valgrind instruction count could give insight on perfermance change, for profiling I used "valgrind --tool=callgrind --dump-instr=yes ./target/optimized/examples/full_rlnc" command. Before change 22M IR, after - 12M.
I'll look into SIMD half-order tables, currently I assume there will be two loads per byte with them, while full table needs only one load, it's cache-local since only 256 bytes per call are needed, will fit into L1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants