erasure_code: xgft loading optimization for AVX2 & dispatch opimization by MaodiMa · Pull Request #365 · intel/isa-l

MaodiMa · 2025-09-28T11:03:20Z

Commit 1: erasure_code: load xgfts with VBROADCASTI128 in x86 AVX2 impl

This commit adjust gf_vect_dot_prod_avx2 & gf_vect_mad_avx2 to improve performance

Using vbroadcasti128 instead of vmovdqu+vperm2i128 when loading xgfts.
Instruction order adjustment
Allocation of GPR takes compatibility on both win64 & elf64 into account.

Performance is checked on Hygon 7490 & AMD Zen 2. Both of them support up to AVX2 instruction set. Data lengh = 4K, k+p = 10+1/2/3/4/5/6. Data in the table below means bandwidth opt/baseline - 100%:
encode:

	10+1	10+2	10+3	10+4	10+5	10+6
Hygon 7490	62.61%	99.74%	120.85%	81.70%	81.34%	71.56%
AMD Zen 2	21.70%	50.90%	64.24%	59.12%	50.17%	48.92%

update:

	10+1	10+2	10+3	10+4	10+5	10+6
Hygon 7490	0.41%	1.27%	46.51%	63.21%	60.79%	60.13%
AMD Zen 2	-0.26%	-0.27%	50.59%	44.42%	48.78%	34.13%

1/2vect mad doesn't show performance improvement because modification is not in the main loop body. Improvement can be observed too if data is really short (code outside the loop will also account for a large proportion).

Performance is not checked strictly on Intel platform because we don't have appropriate one. But we checked it on Intel Sapphire Rapids and force it to use AVX2 routine:
encode:

	10+1	10+2	10+3	10+4	10+5	10+6
Intel Sapphire Rapids	15.10%	16.17%	21.28%	19.19%	19.54%	20.89%

update:

	10+1	10+2	10+3	10+4	10+5	10+6
Intel Sapphire Rapids	0.27%	0.64%	9.26%	-0.36%	-0.74%	-1.93%

Commit 2: erasure_code: add special dispatch case

We observed that the most advanced instruction set may not perform best on certain platform (e.g. Hygon 1/2/3). So we add a special dispatch case, in order to identify the platform and use a customized routine.
Here we identify Hygon 1/2/3 platform and use ec_encode_data_update_avx implemention on them, even if they have the ability to execute AVX2 instruction. Identification is done with Vendor ID & Family+Model so it won't affect any other platform.

Performance is checked on Hygon 7291. Data lengh = 4K, k+p = 10+1/2/3/4/5/6. Data in table below means bandwidth opt/baseline - 100%.
update:

	10+1	10+2	10+3	10+4	10+5	10+6
Hygon 7291 (upstream as baseline)	5.05%	0.61%	38.91%	56.10%	46.16%	55.74%
Hygon 7291 (commit 1 as baseline)	4.61%	0.12%	1.1%	8.57%	3.08%	7.38%

Note that we put 2 group of data here, using upstream & commit 1 as baseline respectively. It proves that using AVX implement on Hygon 1/2/3 platform do run faster than AVX2.

Elton-Yang · 2025-10-10T03:05:39Z

Hi @pablodelara , could you please kindly take a look at this PR? just wanted to gently check in on this. Is there anything we can do to help move this forward? Thanks for your time!

pablodelara · 2025-10-13T10:16:25Z

erasure_code/gf_2vect_dot_prod_avx2.asm

+	vbroadcasti128	xgft1_hi, [tmp+16]		;            hi | hi
+	add	vec_i, PS
+	vbroadcasti128	xgft2_lo, [tmp +vec*(32/PS)]	;Load array: lo | lo
+	vbroadcasti128	xgft2_hi, [tmp5+vec*(32/PS)]	;            hi | hi


You don't need to use tmp5. You can use tmp+16 directly. Same for other files

You are right. Sorry about that. This is a legacy issue because we encountered a problem that the displacement is already occupied in an old version. I have already made the change. Please let me know if you have any other further review comment.

pablodelara · 2025-10-15T09:29:32Z

include/multibinary.asm

+		mov	eax, 1
+		cpuid
+		and	eax, FLAG_CPUID1_EAX_STEP_MASK
+		mov	ecx, FLAG_CPUID1_EAX_HYGON1


Is there any difference in CPU flags for Hygon 1, 2 or 3, so we could use it instead of directly the CPU model?

Here we want to distinguish between "Hygon 1/2/3" and "other newer Hygon platforms", especially Hygon 4. The reason why "Hygon 1/2/3" and "Hygon 4" behave differently is uarch related. As far as we know, there is no flag can reveal the feature that cause the performance difference. Althought we may able to distinguish Hygon 1/2/3 according to some UNRELATED flags, that would be kind of inappropriate in logic and unreliable.
There is a macro named FLAG_CPUID1_EAX_AVOTON for Intel Avoton. My original intention was to follow this approach.

I see... can you add a comment explaining the logic here? What's the interface used for Hygon 1/2/3 and 4.

Done. Is that clear?

pablodelara · 2025-10-15T11:34:17Z

erasure_code/gf_6vect_mad_avx2.asm

-	XLDR	xd3, [dest3+tmp]	;Get next dest vector
-	XLDR	xd4, [dest4+tmp]	;Get next dest vector
-	XLDR	xd5, [dest5+tmp]	;Get next dest vector
+	XLDR	x0, [src+tmp6]		;Get next source vector


Try to minimize code changes. In this case, tmp can stay here.

Prior to this commit, 'tmp' is reloaded with 'len' in line 301. Here we want to keep the address in 'tmp' (original line 192: mul_array + vec_i) and continue to use it later (new line 347, 360, etc.), because the loading pattern of xgft changed. That is to say, we stop reusing 'tmp' in this commit.
Or we could keep the usage of 'tmp' and reload 'mul_array + vec_i' into 'tmp6' to minimize the change. That might be slightly redundant in logic, because we loose the address we need in 'tmp'. Is that seems more reasonable for you?

Sure, since you are adding the use of more GP registers, it should be possible to reduce code changes, please.

I have kept the usage of tmp here and tried to minimize the change. Other MAD codes are also changed as well. Thanks.

pablodelara · 2025-10-15T11:34:40Z

erasure_code/gf_6vect_dot_prod_avx2.asm

 	vpxor	xgft3_hi, xgft3_lo	;GF add high and low partials
 	vpxor	xp6, xgft3_hi		;xp6 += partial

+


Remove extra blank lines

Sure. I will remove it together with other comments later.

pablodelara

A few more comments from me, thanks.

pablodelara · 2025-10-22T18:31:19Z

erasure_code/gf_3vect_mad_avx2.asm

 %define tmp    r11
 %define tmp.w  r11d
 %define tmp.b  r11b
+ %define tmp2   r13


If you use r10, you don't need to worry about preserving an extra GP register.

That's ture. I also fixed the same issue in gf_4vect_mad_avx2. Thanks.

pablodelara · 2025-10-22T18:36:54Z

erasure_code/gf_6vect_mad_avx2.asm

-	XLDR	xd3, [dest3+tmp]	;Get next dest vector
-	XLDR	xd4, [dest4+tmp]	;Get next dest vector
-	XLDR	xd5, [dest5+tmp]	;Get next dest vector
+	XLDR	x0, [src+tmp6]		;Get next source vector


Sure, since you are adding the use of more GP registers, it should be possible to reduce code changes, please.

pablodelara · 2025-10-22T18:40:15Z

include/multibinary.asm

+		mov	eax, 1
+		cpuid
+		and	eax, FLAG_CPUID1_EAX_STEP_MASK
+		mov	ecx, FLAG_CPUID1_EAX_HYGON1


I see... can you add a comment explaining the logic here? What's the interface used for Hygon 1/2/3 and 4.

pablodelara

Apologies for the late response, some comments from my part.

pablodelara · 2025-11-05T12:38:37Z

erasure_code/ec_multibinary.asm

 mbin_dispatch_init5 gf_vect_mul, gf_vect_mul_base, gf_vect_mul_sse, gf_vect_mul_avx, gf_vect_mul_avx
 mbin_dispatch_init8 ec_encode_data, ec_encode_data_base, ec_encode_data_sse, ec_encode_data_avx, ec_encode_data_avx2, ec_encode_data_avx512, ec_encode_data_avx2_gfni, ec_encode_data_avx512_gfni
-mbin_dispatch_init8 ec_encode_data_update, ec_encode_data_update_base, ec_encode_data_update_sse, ec_encode_data_update_avx, ec_encode_data_update_avx2, ec_encode_data_update_avx512, ec_encode_data_update_avx2_gfni, ec_encode_data_update_avx512_gfni
+mbin_dispatch_init8_spclcase ec_encode_data_update, ec_encode_data_update_base, ec_encode_data_update_sse, ec_encode_data_update_avx, ec_encode_data_update_avx2, ec_encode_data_update_avx512, ec_encode_data_update_avx2_gfni, ec_encode_data_update_avx512_gfni


Can you rename this to "mbin_dispatch_init8_hygon"?

I named it in this way because I thought this could be more general. If any similar problems happen on any other platform (like Avoton?), we can put them all together in this dispatch macro as an entire special case, and use it to dispatch those functions affected.
But I can rename it as you said. Maybe I overthink about it.

pablodelara · 2025-11-05T12:42:47Z

erasure_code/gf_3vect_mad_avx2.asm

-	vperm2i128 xgft3_lo, xgft3_lo, xgft3_lo, 0x00 ; swapped to lo | lo
+	vbroadcasti128	xgft2_lo, [tmp2+vec]	; Load array: lo | lo
+	vbroadcasti128	xtmph2,   [tmp2+vec+16]	;             hi | hi
+	vbroadcasti128	xgft3_lo, [tmp2+2*vec]	; Load array: Lo | lo


Nitpick... lowercase: "lo"

pablodelara · 2025-11-05T12:43:45Z

erasure_code/gf_3vect_mad_avx2.asm

+ %define tmp.w  r11d
+ %define tmp.b  r11b
+ %define tmp2   r10
+ %define tmp2.w r10d


tmp2.w and tmp2.b not needed, correct?

Done. I also removed return.w and pos.w. They are not in use as well.

pablodelara · 2025-11-05T12:45:34Z

erasure_code/gf_4vect_mad_avx2.asm

@@ -66,6 +71,8 @@
 	movdqa	[rsp+16*9],xmm15
 	save_reg	r12,  10*16 + 0*8


Actually, could you fix the "movdqa" here? They should be AVX encoded ("vmovdqa").

Never mind, it's now fixed in mainline

pablodelara · 2025-11-05T12:47:05Z

erasure_code/gf_5vect_mad_avx2.asm

 %define tmp.b  r11b
 %define tmp2   r10
+ %define tmp3   r12		; must be saved and restored
+ %define tmp3.w   r12d


Same: not needed .w and .b

pablodelara · 2025-11-05T12:48:31Z

erasure_code/gf_4vect_dot_prod_avx2.asm

+
+	vbroadcasti128	xgft1_lo, [tmp]			;Load array: lo | lo
+	vbroadcasti128	xgft1_hi, [tmp+16]		;            hi | hi
+	vbroadcasti128	xgft2_lo, [tmp+vec*(32/PS)]	;Load array: lo | lo


Make sure comments are consistent (looking at "vec*3") below.

pablodelara · 2025-11-05T12:49:15Z

erasure_code/gf_vect_dot_prod_avx2.asm

-	vperm2i128 xgft_lo, xgft_lo, xgft_lo, 0x00 ; swapped to lo | lo
-
-	XLDR	x0, [ptr+pos]		;Get next source vector
+	vbroadcasti128	xgft_lo, [tmp]		;Load array: hi | hi


Wrong comment

pablodelara · 2025-11-06T15:45:24Z

@MaodiMa can you rebase on top of main? I have made changes that are conflicting this PR.

To generate the side-by-side pattern of two 128-bit xgfts within a YMM reg, loading them with VBROADCASTI128 from mem directly can be faster than loading then swapping them with VMOVDQU + VPERM2i128. Remove some out-of-date macro as well. Signed-off-by: Maodi Ma <mamaodi@hygon.cn>

Using highest-level instruction set may not reveal the best performance on certain platform. E.g. using AVX impl for ec updating instead of AVX2 impl can be faster on Hygon 1/2/3 platform. This commit identifies Hygon platform and use a special dispatch case for ec_encode_data_update to choose certain instruction set impl. Signed-off-by: Maodi Ma <mamaodi@hygon.cn>

pablodelara · 2025-11-17T16:45:11Z

This is now merged, thanks!

pablodelara reviewed Oct 13, 2025

View reviewed changes

MaodiMa force-pushed the master branch from 5dc6711 to 6542d97 Compare October 14, 2025 08:42

pablodelara reviewed Oct 15, 2025

View reviewed changes

pablodelara reviewed Oct 22, 2025

View reviewed changes

MaodiMa force-pushed the master branch from 6542d97 to ce9d31b Compare October 24, 2025 10:24

pablodelara reviewed Nov 5, 2025

View reviewed changes

MaodiMa force-pushed the master branch from ce9d31b to e5ed271 Compare November 7, 2025 12:16

MaodiMa force-pushed the master branch from e5ed271 to 4af31c6 Compare November 10, 2025 02:49

pablodelara closed this Nov 17, 2025

		vpxor xgft3_hi, xgft3_lo ;GF add high and low partials
		vpxor xp6, xgft3_hi ;xp6 += partial

		@@ -66,6 +71,8 @@
		movdqa [rsp+16*9],xmm15
		save_reg r12, 1016 + 08

Conversation

MaodiMa commented Sep 28, 2025

Uh oh!

Elton-Yang commented Oct 10, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaodiMa Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaodiMa Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pablodelara left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pablodelara left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pablodelara commented Nov 6, 2025

MaodiMa Oct 14, 2025 •

edited

Loading

MaodiMa Oct 16, 2025 •

edited

Loading