Skip to content

Conversation

@MaodiMa
Copy link
Contributor

@MaodiMa MaodiMa commented Dec 30, 2025

This pull request contains 3 modifications:

Part 1: Do the same thing as part of https://github.com/intel/isa-l/pull/375, adding fastpath for short data. The threshold is changed to 8 bytes based on tested performance.

Part 2: Change the entrance alignment of crc32_iscsi_by16_10 from 16 to 64. This improves bandwidth about 5%-17% on our Intel(R) Xeon(R) Platinum 8480+ platform for short data under 16 bytes. But it does no impact on AMD or Hygon platform.

Part 3: Adjust the ordering of some instructions in 128_done. This brings ~5% improvement on Hygon platform for scenarios that highly depend on it, like 17-31 bytes. This does no impact on Intel or AMD platform.

This commit has no affect on long data processing.

Performance data are listed below (iteration/s):

  Hygon     Intel(R) Xeon(R) Platinum 8480+ AMD Ryzen 7 9700X (Zen5)
byte(s) baseline opt speedup baseline opt speedup baseline opt speedup
1 138.78 156.11 12.49% 158.66 176.37 11.16% 549.02 613.71 11.78%
2 138.77 166.52 20.00% 159.21 187.70 17.90% 549.45 682.67 24.25%
3 131.47 166.54 26.67% 166.99 187.85 12.49% 577.76 681.54 17.96%
4 138.77 166.49 19.98% 159.93 187.82 17.44% 547.13 680.96 24.46%
5 138.77 166.51 19.99% 159.85 187.78 17.48% 546.41 681.42 24.71%
6 138.80 178.42 28.55% 161.03 201.47 25.12% 546.91 692.49 26.62%
7 138.78 156.11 12.49% 161.28 202.22 25.38% 547.06 692.48 26.58%
8 138.77 146.93 5.88% 161.71 168.60 4.26% 546.87 576.47 5.41%
9 138.77 146.93 5.88% 161.78 170.56 5.42% 546.67 577.29 5.60%
10 138.77 156.14 12.51% 161.96 177.43 9.55% 547.19 613.67 12.15%
11 138.76 146.94 5.89% 161.65 177.20 9.62% 547.05 613.85 12.21%
12 138.77 156.13 12.50% 161.15 177.44 10.10% 546.54 613.98 12.34%
13 138.78 146.94 5.88% 161.03 177.67 10.33% 546.33 614.06 12.40%
14 138.77 146.95 5.89% 161.30 188.50 16.86% 546.09 670.60 22.80%
15 138.74 124.90 -9.98% 161.29 188.91 17.12% 546.62 668.72 22.34%
16 146.93 192.16 30.79% 167.09 216.45 29.54% 554.33 692.55 24.93%
17 113.53 124.88 10.00% 158.29 177.64 12.22% 426.22 469.70 10.20%
18 113.54 124.88 9.99% 157.90 178.06 12.77% 425.06 470.97 10.80%
19 113.54 124.90 10.00% 158.30 178.32 12.65% 426.31 472.10 10.74%
20 113.54 124.90 10.00% 158.33 178.22 12.56% 427.78 470.23 9.92%
21 113.54 124.90 10.00% 158.37 178.70 12.84% 425.41 468.79 10.20%
22 113.53 124.89 10.01% 158.40 178.21 12.50% 425.39 469.31 10.32%
23 113.54 124.90 10.00% 158.38 178.25 12.55% 427.54 471.33 10.24%
24 113.53 124.91 10.02% 158.31 177.99 12.43% 425.99 465.73 9.33%
25 113.53 124.88 10.00% 158.55 177.91 12.21% 425.58 470.15 10.47%
26 113.53 124.90 10.02% 158.14 178.21 12.69% 427.32 468.44 9.62%
27 113.54 124.89 10.00% 157.84 178.75 13.25% 426.95 468.28 9.68%
28 113.55 124.89 9.99% 157.80 178.12 12.88% 425.31 468.46 10.15%
29 113.53 124.89 10.00% 157.82 178.38 13.02% 425.65 470.07 10.43%
30 113.55 124.90 10.00% 157.81 177.99 12.79% 424.60 469.40 10.55%
31 113.54 124.91 10.02% 157.80 177.58 12.53% 426.31 469.33 10.09%
32 131.47 143.09 8.84% 152.36 187.85 23.29% 507.03 554.00 9.26%
33 99.91 99.92 0.01% 152.96 180.36 17.92% 329.83 330.85 0.31%
34 99.91 99.92 0.01% 152.88 180.02 17.75% 328.65 331.69 0.92%
                   
1024 37.67 37.71 0.12% 78.34 78.33 -0.01% 86.60 86.59 -0.02%
2048 19.68 19.67 -0.06% 40.90 40.90 0.00% 43.30 43.29 -0.02%
3072 13.27 13.26 -0.08% 28.70 28.70 0.00% 28.71 28.86 0.52%
4096 10.02 10.02 -0.01% 21.79 21.79 0.02% 21.56 21.65 0.39%
5120 8.04 8.04 0.00% 17.57 17.57 0.00% 17.26 17.32 0.33%
6144 6.72 6.72 0.07% 14.71 14.71 0.01% 14.39 14.43 0.26%
7168 5.77 5.77 -0.01% 12.65 12.66 0.04% 12.34 12.37 0.23%
8192 5.06 5.06 0.01% 11.10 11.11 0.06% 10.80 10.82 0.20%

Only 15 bytes scenario on Hygon platform got a negtive affect. Overall, the performance got improved.
Besides, you can see there is improvement for data longer than 32 bytes only on Intel platform. That is 64 bytes alignment working.

@MaodiMa
Copy link
Contributor Author

MaodiMa commented Jan 23, 2026

@pablodelara Hi, sorry to bother. This pull request has been created for some time. I was wondering if there is anything we could do to improve this work?

Copy link
Contributor

@pablodelara pablodelara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two minor comments, thanks!

je .exact_16_left
jl .less_than_16_left

vmovdqu xmm7, [arg2] ; load the plaintext
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment saying there are a between 17 and 32 bytes at this stage?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Done

add arg2,2

.less_than_2:
test arg3,1 ; check if 1 byte remaining
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be shorter to do "or arg3,arg3"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, but the test instr in other labels before will not change the result of arg3. If arg3 is 0b10, after processed in less_than_4, arg3 keeps the value of 0b10. Then or arg3, arg3 will lead to a incorrect not-taken branch. So we had to check only the least significant bit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, thanks!

- Add fastpath for short data
- Change code align from 16 to 64
- Code reordering when dealing final 16 bytes after folding

Signed-off-by: Maodi Ma <mamaodi@hygon.cn>
@pablodelara
Copy link
Contributor

This PR is merged now, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants