Small micro-optimizations for raid algos. by Shark64 · Pull Request #359 · intel/isa-l

Shark64 · 2025-08-25T16:13:23Z

Small patch to keep sub and jcc together to enable macrofusion. Also use shorter opcodes when possibile.

…usion. Also use shorter opcodes when possibile. Signed-off-by: Nicola Torracca <nicola.torracca@gmail.com>

Shark64 · 2025-08-26T16:18:00Z

@pablodelara sorry to ping you directly, but I'm not sure what to do about the CI failures. On my Linux machine all the tests pass without errors, and I don't have a windows machine to check. Is there something I need to do on my side to fix this? Thx :)

pablodelara · 2025-08-30T08:34:38Z

raid/pq_gen_avx2.asm

 	XLDR	xs1, [ptr+pos]		; Get next vector (source data)
-	jg	next_vect32		; Loop for each vect except 0
+	sub	tmp, 1		  	;Inner loop for each source vector
+	jg	next_vect32		; Loop for each vect


This should be jge like in some other places.

Yeah, missed that, thanks. Fixed

pablodelara · 2025-08-30T08:37:49Z

raid/xor_gen_avx.asm

-	vxorpd	ymm1, ymm1, ymm5
-	vxorpd	ymm2, ymm2, ymm6
-	vxorpd	ymm3, ymm3, ymm7
+	vpxor	ymm0, ymm0, ymm4	;Add to xor parity


This instruction can only be executed on systems that support AVX2, where this function implements AVX only instructions. vxorpd is an AVX instruction supporting ymms (https://www.felixcloutier.com/x86/xorpd) , whereas vpxor only supports ymms if AVX2 is supported (https://www.felixcloutier.com/x86/pxor).

pablodelara · 2025-08-30T08:38:54Z

raid/xor_gen_avx512.asm

 	XSTR	[ptr+pos], zmm0		;Write parity xor vector
 	XSTR	[ptr+pos+64], zmm1
-	add	pos, 128
+	sub	pos, -128


I understand the point of this, but given that frontend is not an issue, it's not worth changing it.

pablodelara · 2025-08-30T08:45:34Z

@pablodelara sorry to ping you directly, but I'm not sure what to do about the CI failures. On my Linux machine all the tests pass without errors, and I don't have a windows machine to check. Is there something I need to do on my side to fix this? Thx :)

Hi @Shark64. I assume you only had access to a machine that has access to AVX512 with GFNI instructions potentially, but you made changes in other implementations. The way you could test the other implementations is using "sde" tool (https://www.intel.com/content/www/us/en/developer/articles/tool/software-development-emulator.html), targeting different architectures.
Said this, I don't think these optimizations are needed anyway, since (as said in one of the comments I left), there is no bottleneck in the frontend, especially for scalar instructions (I saw it being 1% front-end bound, compared to 62% back-end bound, on P+Q generation functions).
Lastly, please avoid "merge" commits like the one you have in the PR, for future contributions. Thanks!

Fix condition code for macrofusion.

registers for AVX targets. Signed-off-by: Nicola Torracca <nicola.torracca@gmail.com>

Shark64 · 2025-09-01T11:43:30Z

Thank for the explainations @pablodelara. I installed SDE and I'll give it a try.
For sure this change isn't anything major but seems like a waste to do the same work with 2 m-ops when one is enough ;). I fixed the jcc and reverted to the xorpd. Let me know if you think it's ok to merge this or not.

BTW, the xor_gen_avx512 has a whole set of branchy code for the non aligned case. I think a partial load using the mask registers would work much better, do you think that is worth trying, even if it's not strictly in the hot loop?
Thanks :)

pablodelara · 2025-09-15T07:32:48Z

Hi @Shark64. Apologies, I was on holidays and I have been catching up for the last few days. Thanks for taking the time to go through my comments. I still believe we need to be pragmatical here. Have you seen a significant performance improvement (that is not "noise", say consistent 3-5%, with any of these changes?
The main "problem" of this PR is that it mixes several changes, and some of those changes make the code a bit less readable (like the "add len, -128" or less maintainable/flexible (like changing from EVEX encoded to VEX encoded instructions, limiting the registers to use), whereas other changes make sense.
One last thing: we generally merge the commits as they are in the PR. This one has 4 commits that can be 1.
Thanks!

Shark64 · 2025-09-17T00:13:29Z

Hi, no apologies needed, it's not something that's urgent or anything.
The changes are somewhat measurable with perf, with macrofusion you end up with less total instructions, and that is above the noise-level. The rest not so much.
If you prefer I can close this pull request and make another one with only the changes to enable macrofusion in the loop, dropping the other small optimizations, and with a single commit. Let me know if you think it's worth. It's nothing major but I do think that fusing instructions whenever possibile should be the rule with modern assembly coding.
Thanks :)

pablodelara · 2025-09-17T09:21:51Z

Hi, no apologies needed, it's not something that's urgent or anything. The changes are somewhat measurable with perf, with macrofusion you end up with less total instructions, and that is above the noise-level. The rest not so much. If you prefer I can close this pull request and make another one with only the changes to enable macrofusion in the loop, dropping the other small optimizations, and with a single commit. Let me know if you think it's worth. It's nothing major but I do think that fusing instructions whenever possibile should be the rule with modern assembly coding. Thanks :)

Sure, do that, thanks!

Shark64 · 2025-09-18T12:07:34Z

Sure, do that, thanks!

Quick question: can I also change the various "mov reg, 0" with the canonical "xor reg, reg"? Seems not worth a separate commit message, so I can fold them with the macrofusion changes, if it's ok with you.

pablodelara · 2025-09-18T12:10:41Z

Sure, do that, thanks!

Quick question: can I also change the various "mov reg, 0" with the canonical "xor reg, reg"? Seems not worth a separate commit message, so I can fold them with the macrofusion changes, if it's ok with you.

Sure, if it's only that :)

Small micro-optimizations: keep sub and jcc together to enable macrof…

08ef9fd

…usion. Also use shorter opcodes when possibile. Signed-off-by: Nicola Torracca <nicola.torracca@gmail.com>

Merge branch 'intel:master' into macrofusion

1cac6ad

pablodelara reviewed Aug 30, 2025

View reviewed changes

Shark64 and others added 2 commits August 30, 2025 15:25

Update pq_gen_avx2.asm

a4036b7

Fix condition code for macrofusion.

Revert changes for xor_gen_avx. Only vxorps/pd is supported on ymm

008bd0b

registers for AVX targets. Signed-off-by: Nicola Torracca <nicola.torracca@gmail.com>

Shark64 closed this Sep 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Small micro-optimizations for raid algos.#359

Small micro-optimizations for raid algos.#359
Shark64 wants to merge 4 commits intointel:masterfrom
Shark64:macrofusion

Shark64 commented Aug 25, 2025

Uh oh!

Shark64 commented Aug 26, 2025

Uh oh!

pablodelara Aug 30, 2025

Uh oh!

Shark64 Aug 30, 2025

Uh oh!

pablodelara Aug 30, 2025

Uh oh!

pablodelara Aug 30, 2025

Uh oh!

pablodelara commented Aug 30, 2025

Uh oh!

Shark64 commented Sep 1, 2025

Uh oh!

pablodelara commented Sep 15, 2025

Uh oh!

Shark64 commented Sep 17, 2025

Uh oh!

pablodelara commented Sep 17, 2025

Uh oh!

Shark64 commented Sep 18, 2025 •

edited

Loading

Uh oh!

pablodelara commented Sep 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

Shark64 commented Aug 25, 2025

Uh oh!

Shark64 commented Aug 26, 2025

Uh oh!

pablodelara Aug 30, 2025

Choose a reason for hiding this comment

Uh oh!

Shark64 Aug 30, 2025

Choose a reason for hiding this comment

Uh oh!

pablodelara Aug 30, 2025

Choose a reason for hiding this comment

Uh oh!

pablodelara Aug 30, 2025

Choose a reason for hiding this comment

Uh oh!

pablodelara commented Aug 30, 2025

Uh oh!

Shark64 commented Sep 1, 2025

Uh oh!

pablodelara commented Sep 15, 2025

Uh oh!

Shark64 commented Sep 17, 2025

Uh oh!

pablodelara commented Sep 17, 2025

Uh oh!

Shark64 commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pablodelara commented Sep 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Shark64 commented Sep 18, 2025 •

edited

Loading