Skip to content

Small micro-optimizations for raid algos.#359

Closed
Shark64 wants to merge 4 commits intointel:masterfrom
Shark64:macrofusion
Closed

Small micro-optimizations for raid algos.#359
Shark64 wants to merge 4 commits intointel:masterfrom
Shark64:macrofusion

Conversation

@Shark64
Copy link
Contributor

@Shark64 Shark64 commented Aug 25, 2025

Small patch to keep sub and jcc together to enable macrofusion. Also use shorter opcodes when possibile.

…usion. Also use shorter opcodes when possibile.

Signed-off-by: Nicola Torracca <nicola.torracca@gmail.com>
@Shark64
Copy link
Contributor Author

Shark64 commented Aug 26, 2025

@pablodelara sorry to ping you directly, but I'm not sure what to do about the CI failures. On my Linux machine all the tests pass without errors, and I don't have a windows machine to check. Is there something I need to do on my side to fix this? Thx :)

XLDR xs1, [ptr+pos] ; Get next vector (source data)
jg next_vect32 ; Loop for each vect except 0
sub tmp, 1 ;Inner loop for each source vector
jg next_vect32 ; Loop for each vect
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be jge like in some other places.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, missed that, thanks. Fixed

vxorpd ymm1, ymm1, ymm5
vxorpd ymm2, ymm2, ymm6
vxorpd ymm3, ymm3, ymm7
vpxor ymm0, ymm0, ymm4 ;Add to xor parity
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This instruction can only be executed on systems that support AVX2, where this function implements AVX only instructions. vxorpd is an AVX instruction supporting ymms (https://www.felixcloutier.com/x86/xorpd) , whereas vpxor only supports ymms if AVX2 is supported (https://www.felixcloutier.com/x86/pxor).

XSTR [ptr+pos], zmm0 ;Write parity xor vector
XSTR [ptr+pos+64], zmm1
add pos, 128
sub pos, -128
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the point of this, but given that frontend is not an issue, it's not worth changing it.

@pablodelara
Copy link
Contributor

@pablodelara sorry to ping you directly, but I'm not sure what to do about the CI failures. On my Linux machine all the tests pass without errors, and I don't have a windows machine to check. Is there something I need to do on my side to fix this? Thx :)

Hi @Shark64. I assume you only had access to a machine that has access to AVX512 with GFNI instructions potentially, but you made changes in other implementations. The way you could test the other implementations is using "sde" tool (https://www.intel.com/content/www/us/en/developer/articles/tool/software-development-emulator.html), targeting different architectures.
Said this, I don't think these optimizations are needed anyway, since (as said in one of the comments I left), there is no bottleneck in the frontend, especially for scalar instructions (I saw it being 1% front-end bound, compared to 62% back-end bound, on P+Q generation functions).
Lastly, please avoid "merge" commits like the one you have in the PR, for future contributions. Thanks!

Shark64 and others added 2 commits August 30, 2025 15:25
Fix condition code for macrofusion.
registers for AVX targets.

Signed-off-by: Nicola Torracca <nicola.torracca@gmail.com>
@Shark64
Copy link
Contributor Author

Shark64 commented Sep 1, 2025

Thank for the explainations @pablodelara. I installed SDE and I'll give it a try.
For sure this change isn't anything major but seems like a waste to do the same work with 2 m-ops when one is enough ;). I fixed the jcc and reverted to the xorpd. Let me know if you think it's ok to merge this or not.

BTW, the xor_gen_avx512 has a whole set of branchy code for the non aligned case. I think a partial load using the mask registers would work much better, do you think that is worth trying, even if it's not strictly in the hot loop?
Thanks :)

@pablodelara
Copy link
Contributor

Hi @Shark64. Apologies, I was on holidays and I have been catching up for the last few days. Thanks for taking the time to go through my comments. I still believe we need to be pragmatical here. Have you seen a significant performance improvement (that is not "noise", say consistent 3-5%, with any of these changes?
The main "problem" of this PR is that it mixes several changes, and some of those changes make the code a bit less readable (like the "add len, -128" or less maintainable/flexible (like changing from EVEX encoded to VEX encoded instructions, limiting the registers to use), whereas other changes make sense.
One last thing: we generally merge the commits as they are in the PR. This one has 4 commits that can be 1.
Thanks!

@Shark64
Copy link
Contributor Author

Shark64 commented Sep 17, 2025

Hi, no apologies needed, it's not something that's urgent or anything.
The changes are somewhat measurable with perf, with macrofusion you end up with less total instructions, and that is above the noise-level. The rest not so much.
If you prefer I can close this pull request and make another one with only the changes to enable macrofusion in the loop, dropping the other small optimizations, and with a single commit. Let me know if you think it's worth. It's nothing major but I do think that fusing instructions whenever possibile should be the rule with modern assembly coding.
Thanks :)

@pablodelara
Copy link
Contributor

Hi, no apologies needed, it's not something that's urgent or anything. The changes are somewhat measurable with perf, with macrofusion you end up with less total instructions, and that is above the noise-level. The rest not so much. If you prefer I can close this pull request and make another one with only the changes to enable macrofusion in the loop, dropping the other small optimizations, and with a single commit. Let me know if you think it's worth. It's nothing major but I do think that fusing instructions whenever possibile should be the rule with modern assembly coding. Thanks :)

Sure, do that, thanks!

@Shark64 Shark64 closed this Sep 17, 2025
@Shark64
Copy link
Contributor Author

Shark64 commented Sep 18, 2025

Sure, do that, thanks!

Quick question: can I also change the various "mov reg, 0" with the canonical "xor reg, reg"? Seems not worth a separate commit message, so I can fold them with the macrofusion changes, if it's ok with you.

@pablodelara
Copy link
Contributor

Sure, do that, thanks!

Quick question: can I also change the various "mov reg, 0" with the canonical "xor reg, reg"? Seems not worth a separate commit message, so I can fold them with the macrofusion changes, if it's ok with you.

Sure, if it's only that :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments