Small micro-optimizations for raid algos.#359
Conversation
…usion. Also use shorter opcodes when possibile. Signed-off-by: Nicola Torracca <nicola.torracca@gmail.com>
|
@pablodelara sorry to ping you directly, but I'm not sure what to do about the CI failures. On my Linux machine all the tests pass without errors, and I don't have a windows machine to check. Is there something I need to do on my side to fix this? Thx :) |
raid/pq_gen_avx2.asm
Outdated
| XLDR xs1, [ptr+pos] ; Get next vector (source data) | ||
| jg next_vect32 ; Loop for each vect except 0 | ||
| sub tmp, 1 ;Inner loop for each source vector | ||
| jg next_vect32 ; Loop for each vect |
There was a problem hiding this comment.
This should be jge like in some other places.
There was a problem hiding this comment.
Yeah, missed that, thanks. Fixed
raid/xor_gen_avx.asm
Outdated
| vxorpd ymm1, ymm1, ymm5 | ||
| vxorpd ymm2, ymm2, ymm6 | ||
| vxorpd ymm3, ymm3, ymm7 | ||
| vpxor ymm0, ymm0, ymm4 ;Add to xor parity |
There was a problem hiding this comment.
This instruction can only be executed on systems that support AVX2, where this function implements AVX only instructions. vxorpd is an AVX instruction supporting ymms (https://www.felixcloutier.com/x86/xorpd) , whereas vpxor only supports ymms if AVX2 is supported (https://www.felixcloutier.com/x86/pxor).
raid/xor_gen_avx512.asm
Outdated
| XSTR [ptr+pos], zmm0 ;Write parity xor vector | ||
| XSTR [ptr+pos+64], zmm1 | ||
| add pos, 128 | ||
| sub pos, -128 |
There was a problem hiding this comment.
I understand the point of this, but given that frontend is not an issue, it's not worth changing it.
Hi @Shark64. I assume you only had access to a machine that has access to AVX512 with GFNI instructions potentially, but you made changes in other implementations. The way you could test the other implementations is using "sde" tool (https://www.intel.com/content/www/us/en/developer/articles/tool/software-development-emulator.html), targeting different architectures. |
Fix condition code for macrofusion.
registers for AVX targets. Signed-off-by: Nicola Torracca <nicola.torracca@gmail.com>
|
Thank for the explainations @pablodelara. I installed SDE and I'll give it a try. BTW, the xor_gen_avx512 has a whole set of branchy code for the non aligned case. I think a partial load using the mask registers would work much better, do you think that is worth trying, even if it's not strictly in the hot loop? |
|
Hi @Shark64. Apologies, I was on holidays and I have been catching up for the last few days. Thanks for taking the time to go through my comments. I still believe we need to be pragmatical here. Have you seen a significant performance improvement (that is not "noise", say consistent 3-5%, with any of these changes? |
|
Hi, no apologies needed, it's not something that's urgent or anything. |
Sure, do that, thanks! |
Quick question: can I also change the various "mov reg, 0" with the canonical "xor reg, reg"? Seems not worth a separate commit message, so I can fold them with the macrofusion changes, if it's ok with you. |
Sure, if it's only that :) |
Small patch to keep sub and jcc together to enable macrofusion. Also use shorter opcodes when possibile.