Skip to content

erasure_code: xgft loading optimization for AVX2 & dispatch opimization#365

Closed
MaodiMa wants to merge 2 commits intointel:masterfrom
MaodiMa:master
Closed

erasure_code: xgft loading optimization for AVX2 & dispatch opimization#365
MaodiMa wants to merge 2 commits intointel:masterfrom
MaodiMa:master

Conversation

@MaodiMa
Copy link
Contributor

@MaodiMa MaodiMa commented Sep 28, 2025

Commit 1: erasure_code: load xgfts with VBROADCASTI128 in x86 AVX2 impl

This commit adjust gf_vect_dot_prod_avx2 & gf_vect_mad_avx2 to improve performance

  • Using vbroadcasti128 instead of vmovdqu+vperm2i128 when loading xgfts.
  • Instruction order adjustment
  • Allocation of GPR takes compatibility on both win64 & elf64 into account.

Performance is checked on Hygon 7490 & AMD Zen 2. Both of them support up to AVX2 instruction set. Data lengh = 4K, k+p = 10+1/2/3/4/5/6. Data in the table below means bandwidth opt/baseline - 100%:
encode:

10+1 10+2 10+3 10+4 10+5 10+6
Hygon 7490 62.61% 99.74% 120.85% 81.70% 81.34% 71.56%
AMD Zen 2 21.70% 50.90% 64.24% 59.12% 50.17% 48.92%

update:

10+1 10+2 10+3 10+4 10+5 10+6
Hygon 7490 0.41% 1.27% 46.51% 63.21% 60.79% 60.13%
AMD Zen 2 -0.26% -0.27% 50.59% 44.42% 48.78% 34.13%

1/2vect mad doesn't show performance improvement because modification is not in the main loop body. Improvement can be observed too if data is really short (code outside the loop will also account for a large proportion).

Performance is not checked strictly on Intel platform because we don't have appropriate one. But we checked it on Intel Sapphire Rapids and force it to use AVX2 routine:
encode:

10+1 10+2 10+3 10+4 10+5 10+6
Intel Sapphire Rapids 15.10% 16.17% 21.28% 19.19% 19.54% 20.89%

update:

10+1 10+2 10+3 10+4 10+5 10+6
Intel Sapphire Rapids 0.27% 0.64% 9.26% -0.36% -0.74% -1.93%

Commit 2: erasure_code: add special dispatch case

We observed that the most advanced instruction set may not perform best on certain platform (e.g. Hygon 1/2/3). So we add a special dispatch case, in order to identify the platform and use a customized routine.
Here we identify Hygon 1/2/3 platform and use ec_encode_data_update_avx implemention on them, even if they have the ability to execute AVX2 instruction. Identification is done with Vendor ID & Family+Model so it won't affect any other platform.

Performance is checked on Hygon 7291. Data lengh = 4K, k+p = 10+1/2/3/4/5/6. Data in table below means bandwidth opt/baseline - 100%.
update:

10+1 10+2 10+3 10+4 10+5 10+6
Hygon 7291 (upstream as baseline) 5.05% 0.61% 38.91% 56.10% 46.16% 55.74%
Hygon 7291 (commit 1 as baseline) 4.61% 0.12% 1.1% 8.57% 3.08% 7.38%

Note that we put 2 group of data here, using upstream & commit 1 as baseline respectively. It proves that using AVX implement on Hygon 1/2/3 platform do run faster than AVX2.

@Elton-Yang
Copy link

Hi @pablodelara , could you please kindly take a look at this PR? just wanted to gently check in on this. Is there anything we can do to help move this forward? Thanks for your time!

vbroadcasti128 xgft1_hi, [tmp+16] ; hi | hi
add vec_i, PS
vbroadcasti128 xgft2_lo, [tmp +vec*(32/PS)] ;Load array: lo | lo
vbroadcasti128 xgft2_hi, [tmp5+vec*(32/PS)] ; hi | hi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to use tmp5. You can use tmp+16 directly. Same for other files

Copy link
Contributor Author

@MaodiMa MaodiMa Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. Sorry about that. This is a legacy issue because we encountered a problem that the displacement is already occupied in an old version. I have already made the change. Please let me know if you have any other further review comment.

mov eax, 1
cpuid
and eax, FLAG_CPUID1_EAX_STEP_MASK
mov ecx, FLAG_CPUID1_EAX_HYGON1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any difference in CPU flags for Hygon 1, 2 or 3, so we could use it instead of directly the CPU model?

Copy link
Contributor Author

@MaodiMa MaodiMa Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we want to distinguish between "Hygon 1/2/3" and "other newer Hygon platforms", especially Hygon 4. The reason why "Hygon 1/2/3" and "Hygon 4" behave differently is uarch related. As far as we know, there is no flag can reveal the feature that cause the performance difference. Althought we may able to distinguish Hygon 1/2/3 according to some UNRELATED flags, that would be kind of inappropriate in logic and unreliable.
There is a macro named FLAG_CPUID1_EAX_AVOTON for Intel Avoton. My original intention was to follow this approach.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see... can you add a comment explaining the logic here? What's the interface used for Hygon 1/2/3 and 4.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Is that clear?

XLDR xd3, [dest3+tmp] ;Get next dest vector
XLDR xd4, [dest4+tmp] ;Get next dest vector
XLDR xd5, [dest5+tmp] ;Get next dest vector
XLDR x0, [src+tmp6] ;Get next source vector
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try to minimize code changes. In this case, tmp can stay here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prior to this commit, 'tmp' is reloaded with 'len' in line 301. Here we want to keep the address in 'tmp' (original line 192: mul_array + vec_i) and continue to use it later (new line 347, 360, etc.), because the loading pattern of xgft changed. That is to say, we stop reusing 'tmp' in this commit.
Or we could keep the usage of 'tmp' and reload 'mul_array + vec_i' into 'tmp6' to minimize the change. That might be slightly redundant in logic, because we loose the address we need in 'tmp'. Is that seems more reasonable for you?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, since you are adding the use of more GP registers, it should be possible to reduce code changes, please.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have kept the usage of tmp here and tried to minimize the change. Other MAD codes are also changed as well. Thanks.

vpxor xgft3_hi, xgft3_lo ;GF add high and low partials
vpxor xp6, xgft3_hi ;xp6 += partial


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove extra blank lines

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I will remove it together with other comments later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

@pablodelara pablodelara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more comments from me, thanks.

%define tmp r11
%define tmp.w r11d
%define tmp.b r11b
%define tmp2 r13
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you use r10, you don't need to worry about preserving an extra GP register.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's ture. I also fixed the same issue in gf_4vect_mad_avx2. Thanks.

XLDR xd3, [dest3+tmp] ;Get next dest vector
XLDR xd4, [dest4+tmp] ;Get next dest vector
XLDR xd5, [dest5+tmp] ;Get next dest vector
XLDR x0, [src+tmp6] ;Get next source vector
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, since you are adding the use of more GP registers, it should be possible to reduce code changes, please.

mov eax, 1
cpuid
and eax, FLAG_CPUID1_EAX_STEP_MASK
mov ecx, FLAG_CPUID1_EAX_HYGON1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see... can you add a comment explaining the logic here? What's the interface used for Hygon 1/2/3 and 4.

Copy link
Contributor

@pablodelara pablodelara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the late response, some comments from my part.

mbin_dispatch_init5 gf_vect_mul, gf_vect_mul_base, gf_vect_mul_sse, gf_vect_mul_avx, gf_vect_mul_avx
mbin_dispatch_init8 ec_encode_data, ec_encode_data_base, ec_encode_data_sse, ec_encode_data_avx, ec_encode_data_avx2, ec_encode_data_avx512, ec_encode_data_avx2_gfni, ec_encode_data_avx512_gfni
mbin_dispatch_init8 ec_encode_data_update, ec_encode_data_update_base, ec_encode_data_update_sse, ec_encode_data_update_avx, ec_encode_data_update_avx2, ec_encode_data_update_avx512, ec_encode_data_update_avx2_gfni, ec_encode_data_update_avx512_gfni
mbin_dispatch_init8_spclcase ec_encode_data_update, ec_encode_data_update_base, ec_encode_data_update_sse, ec_encode_data_update_avx, ec_encode_data_update_avx2, ec_encode_data_update_avx512, ec_encode_data_update_avx2_gfni, ec_encode_data_update_avx512_gfni
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you rename this to "mbin_dispatch_init8_hygon"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I named it in this way because I thought this could be more general. If any similar problems happen on any other platform (like Avoton?), we can put them all together in this dispatch macro as an entire special case, and use it to dispatch those functions affected.
But I can rename it as you said. Maybe I overthink about it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

vperm2i128 xgft3_lo, xgft3_lo, xgft3_lo, 0x00 ; swapped to lo | lo
vbroadcasti128 xgft2_lo, [tmp2+vec] ; Load array: lo | lo
vbroadcasti128 xtmph2, [tmp2+vec+16] ; hi | hi
vbroadcasti128 xgft3_lo, [tmp2+2*vec] ; Load array: Lo | lo
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick... lowercase: "lo"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

%define tmp.w r11d
%define tmp.b r11b
%define tmp2 r10
%define tmp2.w r10d
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tmp2.w and tmp2.b not needed, correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I also removed return.w and pos.w. They are not in use as well.

@@ -66,6 +71,8 @@
movdqa [rsp+16*9],xmm15
save_reg r12, 10*16 + 0*8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, could you fix the "movdqa" here? They should be AVX encoded ("vmovdqa").

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind, it's now fixed in mainline

%define tmp.b r11b
%define tmp2 r10
%define tmp3 r12 ; must be saved and restored
%define tmp3.w r12d
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same: not needed .w and .b

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


vbroadcasti128 xgft1_lo, [tmp] ;Load array: lo | lo
vbroadcasti128 xgft1_hi, [tmp+16] ; hi | hi
vbroadcasti128 xgft2_lo, [tmp+vec*(32/PS)] ;Load array: lo | lo
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure comments are consistent (looking at "vec*3") below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

vperm2i128 xgft_lo, xgft_lo, xgft_lo, 0x00 ; swapped to lo | lo

XLDR x0, [ptr+pos] ;Get next source vector
vbroadcasti128 xgft_lo, [tmp] ;Load array: hi | hi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@pablodelara
Copy link
Contributor

@MaodiMa can you rebase on top of main? I have made changes that are conflicting this PR.

To generate the side-by-side pattern of two 128-bit xgfts within a
YMM reg, loading them with VBROADCASTI128 from mem directly can be
faster than loading then swapping them with VMOVDQU + VPERM2i128.

Remove some out-of-date macro as well.

Signed-off-by: Maodi Ma <mamaodi@hygon.cn>
Using highest-level instruction set may not reveal the best
performance on certain platform. E.g. using AVX impl for ec
updating instead of AVX2 impl can be faster on Hygon 1/2/3
platform.

This commit identifies Hygon platform and use a special
dispatch case for ec_encode_data_update to choose certain
instruction set impl.

Signed-off-by: Maodi Ma <mamaodi@hygon.cn>
@pablodelara
Copy link
Contributor

This is now merged, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments