NEON: properly implement _high intrinsics#1030
NEON: properly implement _high intrinsics#1030easyaspi314 wants to merge 2 commits intosimd-everywhere:masterfrom
Conversation
| simde_int16x8_private r_; | ||
| simde_int16x8_private a_ = simde_int16x8_to_private(a); | ||
| simde_int8x16_private b_ = simde_int8x16_to_private(b); | ||
|
|
||
| SIMDE_VECTORIZE | ||
| for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { | ||
| r_.values[i] = a_.values[i] + b_.values[i + ((sizeof(b_.values) / sizeof(b_.values[0])) / 2)]; | ||
| } | ||
|
|
||
| return simde_int16x8_from_private(r_); |
There was a problem hiding this comment.
Hmm.. So you think that there is no architecture/compiler combo that would produce better code from this vectorize loop than the fallback of simde_vaddw_s8(a, simde_vget_high_s8(b)) ?
There was a problem hiding this comment.
I am mostly going for ease of implementation on this PR.
If the compiler is reasonably intelligent it would be able to detect the redundant assignment/shuffle and eliminate it. However I haven't tested codegen.
There was a problem hiding this comment.
GCC and Clang both generate identical code on a downscaled version, eliding the copy.
MSVC x86 emits a few extra instructions on /arch:IA32 either way if I use a copy loop or memcpy, but it isn't terrible. https://godbolt.org/z/Y3v4vjz46
Here is /arch:SSE2: https://godbolt.org/z/nWTKMfh7K
However, 99% of the time MSVC will use SSE2 by default — /arch:IA32 is opt-in.
GCC and Clang are the ones where scalar counts, and they emit identical code.
Long story short, 99% free code reuse.
There was a problem hiding this comment.
Hold up, the story changes with uint16_t... GCC vomits.
There was a problem hiding this comment.
With which version does GCC vomit when compiling the uint16_t functions: the vectorized or the downscaled version?
There was a problem hiding this comment.
It actually seems to be the opposite problem. The autovec codegen is actually bad on vaddw_u16. GCC couldn't autovec the one-shot one.
There was a problem hiding this comment.
It actually seems to be the opposite problem. The autovec codegen is actually bad on vaddw_u16. GCC couldn't autovec the one-shot one.
So you're seeing better code from this PR for GCC?
There was a problem hiding this comment.
No. Rather it is vaddw_u16 having mediocre codegen and reusing it passes those codegen issues to vaddw_high_u16. This is because GCC vectorizes it internally which is better for when SIMD is available
There was a problem hiding this comment.
Okay. Is this PR ready, or do you want to make other changes?
High intrinsics merely have an implicit vget_high or vcombine. There is no need to complicate them further.
|
@easyaspi314 hey-o, does this PR need more work or should I rebase and merge? |
High intrinsics merely have an implicit vget_high or vcombine as a helper for most of the widen or narrow instructions since 64-bit can't address the upper halves of registers anymore. There is no need to complicate them further.