erasure_code: xgft loading optimization for AVX2 & dispatch opimization#365
erasure_code: xgft loading optimization for AVX2 & dispatch opimization#365MaodiMa wants to merge 2 commits intointel:masterfrom
Conversation
|
Hi @pablodelara , could you please kindly take a look at this PR? just wanted to gently check in on this. Is there anything we can do to help move this forward? Thanks for your time! |
| vbroadcasti128 xgft1_hi, [tmp+16] ; hi | hi | ||
| add vec_i, PS | ||
| vbroadcasti128 xgft2_lo, [tmp +vec*(32/PS)] ;Load array: lo | lo | ||
| vbroadcasti128 xgft2_hi, [tmp5+vec*(32/PS)] ; hi | hi |
There was a problem hiding this comment.
You don't need to use tmp5. You can use tmp+16 directly. Same for other files
There was a problem hiding this comment.
You are right. Sorry about that. This is a legacy issue because we encountered a problem that the displacement is already occupied in an old version. I have already made the change. Please let me know if you have any other further review comment.
| mov eax, 1 | ||
| cpuid | ||
| and eax, FLAG_CPUID1_EAX_STEP_MASK | ||
| mov ecx, FLAG_CPUID1_EAX_HYGON1 |
There was a problem hiding this comment.
Is there any difference in CPU flags for Hygon 1, 2 or 3, so we could use it instead of directly the CPU model?
There was a problem hiding this comment.
Here we want to distinguish between "Hygon 1/2/3" and "other newer Hygon platforms", especially Hygon 4. The reason why "Hygon 1/2/3" and "Hygon 4" behave differently is uarch related. As far as we know, there is no flag can reveal the feature that cause the performance difference. Althought we may able to distinguish Hygon 1/2/3 according to some UNRELATED flags, that would be kind of inappropriate in logic and unreliable.
There is a macro named FLAG_CPUID1_EAX_AVOTON for Intel Avoton. My original intention was to follow this approach.
There was a problem hiding this comment.
I see... can you add a comment explaining the logic here? What's the interface used for Hygon 1/2/3 and 4.
There was a problem hiding this comment.
Done. Is that clear?
erasure_code/gf_6vect_mad_avx2.asm
Outdated
| XLDR xd3, [dest3+tmp] ;Get next dest vector | ||
| XLDR xd4, [dest4+tmp] ;Get next dest vector | ||
| XLDR xd5, [dest5+tmp] ;Get next dest vector | ||
| XLDR x0, [src+tmp6] ;Get next source vector |
There was a problem hiding this comment.
Try to minimize code changes. In this case, tmp can stay here.
There was a problem hiding this comment.
Prior to this commit, 'tmp' is reloaded with 'len' in line 301. Here we want to keep the address in 'tmp' (original line 192: mul_array + vec_i) and continue to use it later (new line 347, 360, etc.), because the loading pattern of xgft changed. That is to say, we stop reusing 'tmp' in this commit.
Or we could keep the usage of 'tmp' and reload 'mul_array + vec_i' into 'tmp6' to minimize the change. That might be slightly redundant in logic, because we loose the address we need in 'tmp'. Is that seems more reasonable for you?
There was a problem hiding this comment.
Sure, since you are adding the use of more GP registers, it should be possible to reduce code changes, please.
There was a problem hiding this comment.
I have kept the usage of tmp here and tried to minimize the change. Other MAD codes are also changed as well. Thanks.
| vpxor xgft3_hi, xgft3_lo ;GF add high and low partials | ||
| vpxor xp6, xgft3_hi ;xp6 += partial | ||
|
|
||
|
|
There was a problem hiding this comment.
Remove extra blank lines
There was a problem hiding this comment.
Sure. I will remove it together with other comments later.
pablodelara
left a comment
There was a problem hiding this comment.
A few more comments from me, thanks.
erasure_code/gf_3vect_mad_avx2.asm
Outdated
| %define tmp r11 | ||
| %define tmp.w r11d | ||
| %define tmp.b r11b | ||
| %define tmp2 r13 |
There was a problem hiding this comment.
If you use r10, you don't need to worry about preserving an extra GP register.
There was a problem hiding this comment.
That's ture. I also fixed the same issue in gf_4vect_mad_avx2. Thanks.
erasure_code/gf_6vect_mad_avx2.asm
Outdated
| XLDR xd3, [dest3+tmp] ;Get next dest vector | ||
| XLDR xd4, [dest4+tmp] ;Get next dest vector | ||
| XLDR xd5, [dest5+tmp] ;Get next dest vector | ||
| XLDR x0, [src+tmp6] ;Get next source vector |
There was a problem hiding this comment.
Sure, since you are adding the use of more GP registers, it should be possible to reduce code changes, please.
| mov eax, 1 | ||
| cpuid | ||
| and eax, FLAG_CPUID1_EAX_STEP_MASK | ||
| mov ecx, FLAG_CPUID1_EAX_HYGON1 |
There was a problem hiding this comment.
I see... can you add a comment explaining the logic here? What's the interface used for Hygon 1/2/3 and 4.
pablodelara
left a comment
There was a problem hiding this comment.
Apologies for the late response, some comments from my part.
erasure_code/ec_multibinary.asm
Outdated
| mbin_dispatch_init5 gf_vect_mul, gf_vect_mul_base, gf_vect_mul_sse, gf_vect_mul_avx, gf_vect_mul_avx | ||
| mbin_dispatch_init8 ec_encode_data, ec_encode_data_base, ec_encode_data_sse, ec_encode_data_avx, ec_encode_data_avx2, ec_encode_data_avx512, ec_encode_data_avx2_gfni, ec_encode_data_avx512_gfni | ||
| mbin_dispatch_init8 ec_encode_data_update, ec_encode_data_update_base, ec_encode_data_update_sse, ec_encode_data_update_avx, ec_encode_data_update_avx2, ec_encode_data_update_avx512, ec_encode_data_update_avx2_gfni, ec_encode_data_update_avx512_gfni | ||
| mbin_dispatch_init8_spclcase ec_encode_data_update, ec_encode_data_update_base, ec_encode_data_update_sse, ec_encode_data_update_avx, ec_encode_data_update_avx2, ec_encode_data_update_avx512, ec_encode_data_update_avx2_gfni, ec_encode_data_update_avx512_gfni |
There was a problem hiding this comment.
Can you rename this to "mbin_dispatch_init8_hygon"?
There was a problem hiding this comment.
I named it in this way because I thought this could be more general. If any similar problems happen on any other platform (like Avoton?), we can put them all together in this dispatch macro as an entire special case, and use it to dispatch those functions affected.
But I can rename it as you said. Maybe I overthink about it.
erasure_code/gf_3vect_mad_avx2.asm
Outdated
| vperm2i128 xgft3_lo, xgft3_lo, xgft3_lo, 0x00 ; swapped to lo | lo | ||
| vbroadcasti128 xgft2_lo, [tmp2+vec] ; Load array: lo | lo | ||
| vbroadcasti128 xtmph2, [tmp2+vec+16] ; hi | hi | ||
| vbroadcasti128 xgft3_lo, [tmp2+2*vec] ; Load array: Lo | lo |
There was a problem hiding this comment.
Nitpick... lowercase: "lo"
erasure_code/gf_3vect_mad_avx2.asm
Outdated
| %define tmp.w r11d | ||
| %define tmp.b r11b | ||
| %define tmp2 r10 | ||
| %define tmp2.w r10d |
There was a problem hiding this comment.
tmp2.w and tmp2.b not needed, correct?
There was a problem hiding this comment.
Done. I also removed return.w and pos.w. They are not in use as well.
| @@ -66,6 +71,8 @@ | |||
| movdqa [rsp+16*9],xmm15 | |||
| save_reg r12, 10*16 + 0*8 | |||
There was a problem hiding this comment.
Actually, could you fix the "movdqa" here? They should be AVX encoded ("vmovdqa").
There was a problem hiding this comment.
Never mind, it's now fixed in mainline
erasure_code/gf_5vect_mad_avx2.asm
Outdated
| %define tmp.b r11b | ||
| %define tmp2 r10 | ||
| %define tmp3 r12 ; must be saved and restored | ||
| %define tmp3.w r12d |
There was a problem hiding this comment.
Same: not needed .w and .b
|
|
||
| vbroadcasti128 xgft1_lo, [tmp] ;Load array: lo | lo | ||
| vbroadcasti128 xgft1_hi, [tmp+16] ; hi | hi | ||
| vbroadcasti128 xgft2_lo, [tmp+vec*(32/PS)] ;Load array: lo | lo |
There was a problem hiding this comment.
Make sure comments are consistent (looking at "vec*3") below.
| vperm2i128 xgft_lo, xgft_lo, xgft_lo, 0x00 ; swapped to lo | lo | ||
|
|
||
| XLDR x0, [ptr+pos] ;Get next source vector | ||
| vbroadcasti128 xgft_lo, [tmp] ;Load array: hi | hi |
|
@MaodiMa can you rebase on top of main? I have made changes that are conflicting this PR. |
To generate the side-by-side pattern of two 128-bit xgfts within a YMM reg, loading them with VBROADCASTI128 from mem directly can be faster than loading then swapping them with VMOVDQU + VPERM2i128. Remove some out-of-date macro as well. Signed-off-by: Maodi Ma <mamaodi@hygon.cn>
Using highest-level instruction set may not reveal the best performance on certain platform. E.g. using AVX impl for ec updating instead of AVX2 impl can be faster on Hygon 1/2/3 platform. This commit identifies Hygon platform and use a special dispatch case for ec_encode_data_update to choose certain instruction set impl. Signed-off-by: Maodi Ma <mamaodi@hygon.cn>
|
This is now merged, thanks! |
Commit 1: erasure_code: load xgfts with VBROADCASTI128 in x86 AVX2 impl
This commit adjust gf_vect_dot_prod_avx2 & gf_vect_mad_avx2 to improve performance
Performance is checked on Hygon 7490 & AMD Zen 2. Both of them support up to AVX2 instruction set. Data lengh = 4K, k+p = 10+1/2/3/4/5/6. Data in the table below means bandwidth opt/baseline - 100%:
encode:
update:
1/2vect mad doesn't show performance improvement because modification is not in the main loop body. Improvement can be observed too if data is really short (code outside the loop will also account for a large proportion).
Performance is not checked strictly on Intel platform because we don't have appropriate one. But we checked it on Intel Sapphire Rapids and force it to use AVX2 routine:
encode:
update:
Commit 2: erasure_code: add special dispatch case
We observed that the most advanced instruction set may not perform best on certain platform (e.g. Hygon 1/2/3). So we add a special dispatch case, in order to identify the platform and use a customized routine.
Here we identify Hygon 1/2/3 platform and use ec_encode_data_update_avx implemention on them, even if they have the ability to execute AVX2 instruction. Identification is done with Vendor ID & Family+Model so it won't affect any other platform.
Performance is checked on Hygon 7291. Data lengh = 4K, k+p = 10+1/2/3/4/5/6. Data in table below means bandwidth opt/baseline - 100%.
update:
Note that we put 2 group of data here, using upstream & commit 1 as baseline respectively. It proves that using AVX implement on Hygon 1/2/3 platform do run faster than AVX2.