Skip to content

perf: various optimizations to eliminate branch misprediction in hash_utils#20168

Draft
notashes wants to merge 2 commits intoapache:mainfrom
notashes:feat/brunch-prediction
Draft

perf: various optimizations to eliminate branch misprediction in hash_utils#20168
notashes wants to merge 2 commits intoapache:mainfrom
notashes:feat/brunch-prediction

Conversation

@notashes
Copy link

@notashes notashes commented Feb 5, 2026

Which issue does this PR close?

Rationale for this change

Compile time monomorphization helps bring rehash outside the hot loop where it's not required.

What changes are included in this PR?

Currently the PR adds a specialized hash_dictionary_inner() function with const generic parameters that check for nulls in keys, values. It also handles specific edge cases of just nulls in keys or values.

Are these changes tested?

There are no additional tests yet. But I will add 'em as I continue. The benchmark results seem promising.
here's cargo bench --bench with_hashes -- dictionary for

origin/main
Gnuplot not found, using plotters backend
Benchmarking dictionary_utf8_int32: single, no nulls
Benchmarking dictionary_utf8_int32: single, no nulls: Warming up for 3.0000 s
Benchmarking dictionary_utf8_int32: single, no nulls: Collecting 100 samples in estimated 5.0461 s (470k iterations)
Benchmarking dictionary_utf8_int32: single, no nulls: Analyzing
dictionary_utf8_int32: single, no nulls
                        time:   [10.668 µs 10.700 µs 10.734 µs]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) low mild

Benchmarking dictionary_utf8_int32: single, nulls
Benchmarking dictionary_utf8_int32: single, nulls: Warming up for 3.0000 s
Benchmarking dictionary_utf8_int32: single, nulls: Collecting 100 samples in estimated 5.0428 s (409k iterations)
Benchmarking dictionary_utf8_int32: single, nulls: Analyzing
dictionary_utf8_int32: single, nulls
                        time:   [12.269 µs 12.293 µs 12.322 µs]
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

Benchmarking dictionary_utf8_int32: multiple, no nulls
Benchmarking dictionary_utf8_int32: multiple, no nulls: Warming up for 3.0000 s
Benchmarking dictionary_utf8_int32: multiple, no nulls: Collecting 100 samples in estimated 5.0864 s (162k iterations)
Benchmarking dictionary_utf8_int32: multiple, no nulls: Analyzing
dictionary_utf8_int32: multiple, no nulls
                        time:   [31.357 µs 31.426 µs 31.506 µs]
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low mild
  5 (5.00%) high mild
  1 (1.00%) high severe

Benchmarking dictionary_utf8_int32: multiple, nulls
Benchmarking dictionary_utf8_int32: multiple, nulls: Warming up for 3.0000 s
Benchmarking dictionary_utf8_int32: multiple, nulls: Collecting 100 samples in estimated 5.0842 s (141k iterations)
Benchmarking dictionary_utf8_int32: multiple, nulls: Analyzing
dictionary_utf8_int32: multiple, nulls
                        time:   [36.060 µs 36.135 µs 36.220 µs]
Found 10 outliers among 100 measurements (10.00%)
  3 (3.00%) low severe
  1 (1.00%) low mild
  1 (1.00%) high mild
  5 (5.00%) high severe
feat/brunch-prediction
Gnuplot not found, using plotters backend
Benchmarking dictionary_utf8_int32: single, no nulls
Benchmarking dictionary_utf8_int32: single, no nulls: Warming up for 3.0000 s
Benchmarking dictionary_utf8_int32: single, no nulls: Collecting 100 samples in estimated 5.0176 s (1.1M iterations)
Benchmarking dictionary_utf8_int32: single, no nulls: Analyzing
dictionary_utf8_int32: single, no nulls
                        time:   [4.7186 µs 4.7496 µs 4.7821 µs]
                        change: [−55.829% −55.537% −55.240%] (p = 0.00 < 0.05)
                        Performance has improved.

Benchmarking dictionary_utf8_int32: single, nulls
Benchmarking dictionary_utf8_int32: single, nulls: Warming up for 3.0000 s
Benchmarking dictionary_utf8_int32: single, nulls: Collecting 100 samples in estimated 5.0295 s (712k iterations)
Benchmarking dictionary_utf8_int32: single, nulls: Analyzing
dictionary_utf8_int32: single, nulls
                        time:   [6.9647 µs 7.0426 µs 7.1281 µs]
                        change: [−43.806% −43.445% −42.993%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 16 outliers among 100 measurements (16.00%)
  1 (1.00%) low severe
  4 (4.00%) low mild
  1 (1.00%) high mild
  10 (10.00%) high severe

Benchmarking dictionary_utf8_int32: multiple, no nulls
Benchmarking dictionary_utf8_int32: multiple, no nulls: Warming up for 3.0000 s
Benchmarking dictionary_utf8_int32: multiple, no nulls: Collecting 100 samples in estimated 5.0600 s (348k iterations)
Benchmarking dictionary_utf8_int32: multiple, no nulls: Analyzing
dictionary_utf8_int32: multiple, no nulls
                        time:   [13.365 µs 13.384 µs 13.404 µs]
                        change: [−57.610% −57.464% −57.313%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  2 (2.00%) low severe
  4 (4.00%) low mild
  4 (4.00%) high mild
  2 (2.00%) high severe

Benchmarking dictionary_utf8_int32: multiple, nulls
Benchmarking dictionary_utf8_int32: multiple, nulls: Warming up for 3.0000 s
Benchmarking dictionary_utf8_int32: multiple, nulls: Collecting 100 samples in estimated 5.0569 s (242k iterations)
Benchmarking dictionary_utf8_int32: multiple, nulls: Analyzing
dictionary_utf8_int32: multiple, nulls
                        time:   [20.785 µs 20.962 µs 21.173 µs]
                        change: [−42.370% −42.001% −41.579%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 18 outliers among 100 measurements (18.00%)
  1 (1.00%) low severe
  3 (3.00%) high mild
  14 (14.00%) high severe

Are there any user-facing changes?

@notashes notashes marked this pull request as draft February 5, 2026 12:58
@github-actions github-actions bot added the common Related to common crate label Feb 5, 2026
@adriangb
Copy link
Contributor

adriangb commented Feb 5, 2026

it would be great to first merge a benchmark into main (if one doesn't already exist) to show an improvement

@notashes notashes marked this pull request as ready for review February 5, 2026 16:17
@notashes notashes marked this pull request as draft February 5, 2026 16:49
@notashes
Copy link
Author

notashes commented Feb 5, 2026

hey @adriangb, the benchmarks to test changes already exist in main. It was merged with the PR #19373
The benchmark (with_hashes) already measure two key scenarios:

  • single vs multiple columns (for rehash/combine logic)
  • with nulls vs without nulls (in keys)

here are some numbers:

group                                                main                                   optimized
-----                                                ----                                   ---------
dictionary_utf8_int32: multiple, no nulls            2.73     31.1±0.46µs        ? ?/sec    1.00     11.4±0.83µs        ? ?/sec
dictionary_utf8_int32: multiple, nulls               1.92     39.4±0.61µs        ? ?/sec    1.00     20.5±0.56µs        ? ?/sec
dictionary_utf8_int32: single, no nulls              2.48     10.5±0.19µs        ? ?/sec    1.00      4.2±0.57µs        ? ?/sec
dictionary_utf8_int32: single, nulls                 1.74     12.1±0.18µs        ? ?/sec    1.00      7.0±0.31µs        ? ?/sec

But I'm curious do we also want to benchmark cases where the values contain nulls? let me know what you think

@Dandandan
Copy link
Contributor

run benchmark with_hashes

@alamb-ghbot
Copy link

🤖 ./gh_compare_branch_bench.sh compare_branch_bench.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing feat/brunch-prediction (2f35e0e) to 639971a diff
BENCH_NAME=with_hashes
BENCH_COMMAND=cargo bench --features=parquet --bench with_hashes
BENCH_FILTER=
BENCH_BRANCH_NAME=feat_brunch-prediction
Results will be posted here when complete

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

group                                        feat_brunch-prediction                 main
-----                                        ----------------------                 ----
dictionary_utf8_int32: multiple, no nulls    1.00     25.1±0.05µs        ? ?/sec    2.95     74.0±0.61µs        ? ?/sec
dictionary_utf8_int32: multiple, nulls       1.00     66.5±0.12µs        ? ?/sec    1.63    108.4±0.29µs        ? ?/sec
dictionary_utf8_int32: single, no nulls      1.00      8.2±0.03µs        ? ?/sec    3.48     28.6±0.10µs        ? ?/sec
dictionary_utf8_int32: single, nulls         1.00     23.1±0.07µs        ? ?/sec    1.64     38.0±0.59µs        ? ?/sec
int64: multiple, no nulls                    1.00     38.7±0.06µs        ? ?/sec    1.00     38.8±0.39µs        ? ?/sec
int64: multiple, nulls                       1.24     74.2±0.47µs        ? ?/sec    1.00     59.8±0.61µs        ? ?/sec
int64: single, no nulls                      1.00     11.3±0.03µs        ? ?/sec    1.00     11.4±0.22µs        ? ?/sec
int64: single, nulls                         1.08     23.9±0.19µs        ? ?/sec    1.00     22.2±0.64µs        ? ?/sec
large_utf8: multiple, no nulls               1.00    221.9±1.42µs        ? ?/sec    1.01    223.8±5.79µs        ? ?/sec
large_utf8: multiple, nulls                  1.00   270.8±15.07µs        ? ?/sec    1.02   276.6±11.15µs        ? ?/sec
large_utf8: single, no nulls                 1.00     67.8±2.11µs        ? ?/sec    1.01     68.5±5.63µs        ? ?/sec
large_utf8: single, nulls                    1.00     77.8±0.27µs        ? ?/sec    1.00     78.0±0.26µs        ? ?/sec
utf8: multiple, no nulls                     1.00    236.5±3.39µs        ? ?/sec    1.02    240.9±1.63µs        ? ?/sec
utf8: multiple, nulls                        1.00    258.0±0.45µs        ? ?/sec    1.06    274.7±2.21µs        ? ?/sec
utf8: single, no nulls                       1.00     68.0±0.22µs        ? ?/sec    1.00     68.1±0.63µs        ? ?/sec
utf8: single, nulls                          1.00     76.9±0.28µs        ? ?/sec    1.02     78.5±3.26µs        ? ?/sec
utf8_view (small): multiple, no nulls        1.00     47.6±0.27µs        ? ?/sec    1.00     47.7±1.26µs        ? ?/sec
utf8_view (small): multiple, nulls           1.00     78.4±0.88µs        ? ?/sec    1.00     78.2±0.33µs        ? ?/sec
utf8_view (small): single, no nulls          1.00     13.9±0.25µs        ? ?/sec    1.00     13.9±0.44µs        ? ?/sec
utf8_view (small): single, nulls             1.00     23.7±0.19µs        ? ?/sec    1.00     23.7±0.43µs        ? ?/sec
utf8_view: multiple, no nulls                1.00    229.9±3.76µs        ? ?/sec    1.01    232.9±1.88µs        ? ?/sec
utf8_view: multiple, nulls                   1.00    225.7±1.43µs        ? ?/sec    1.00    224.7±1.63µs        ? ?/sec
utf8_view: single, no nulls                  1.01     74.8±6.15µs        ? ?/sec    1.00     74.2±0.90µs        ? ?/sec
utf8_view: single, nulls                     1.01     72.0±0.21µs        ? ?/sec    1.00     71.3±0.29µs        ? ?/sec

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Optimize rehash/null branching in hash util functions

4 participants