Add AVX2 implementation of first_max_element#2806
Conversation
248542a to
28cb0a9
Compare
28cb0a9 to
db76e3c
Compare
|
Resulting assembly: |
|
It looks like the autovectorizer fails to use ymm registers for the max. Maybe an intrinsic is needed there too? |
I don't think it's possible to use ymm registers for getting the max of 8 32-bit integers, or if it is there probably wouldn't be any benefit. As far as I understand, this is how the algorithm LLVM uses works: max:
; rdi is a pointer to the first element, and it is statically known that there are 8 elements
; load the first 4 elements of the slice into xmm0
vmovdqu xmm0, xmmword ptr [rdi]
; do a packed max the last 4 elements of the slice (rdi + 16)
; basically, xmm0 now looks like this (where x represents the original 8-element array):
; xmm0 = [max(x[0], x[4]), max(x[1], x[5]), max(x[2], x[6]), max(x[3], x[7])]
;
; at this point, we can forget about the entire array and just worry about finding the max of
; the 4 elements that are now in xmm0, since the max is definitely inside xmm0
vpmaxsd xmm0, xmm0, xmmword ptr [rdi + 16]
; shuffle: xmm1 = xmm0[2,3,2,3]
vpshufd xmm1, xmm0, 238
; packed max of xmm0 and xmm1, store the result in xmm0
; here is what the registers look like before this step:
; xmm0 = [x[0], x[1], x[2], x[3]]
; xmm1 = [x[2], x[3], x[2], x[3]]
; so the last 2 elements of xmm0 after this instruction are going to be irrelevant,
; and as you can see, the max will be either the first or second element of xmm0
vpmaxsd xmm0, xmm0, xmm1
; shuffle : xmm1 = xmm0[1,1,1,1]
; xmm0 = [max(x[0], x[2]), max(x[1], x[3]), ...]
; xmm1 = [max(x[1], x[3]), max(x[1], x[3]), ...]
; (other elements omitted since we don't care about them any more)
vpshufd xmm1, xmm0, 85
; packed max of xmm0 and xmm1, store the result in xmm0
vpmaxsd xmm0, xmm0, xmm1
; now xmm0 = [max(max(x[0], x[2]), max(x[1], x[3])), ...]
; so move the first element of xmm0, which is the final max, into eax
vmovd eax, xmm0
ret |
|
Ah yeah sorry, yeah the first max comparison can only be 4 vs 4. |
db76e3c to
47ad5a4
Compare
This also now requires BMI1 and BMI2 for AVX2 in `CpuFeatureLevel`.
47ad5a4 to
cff1af9
Compare
|
I'm confused. Isn't this already optimized from the assembly called from here? https://github.com/xiph/rav1e/blob/master/src/asm/x86/cdef.rs#L195 |
|
@KyleSiefring I think you're right, I didn't notice this. If I'm not mistaken, the actual assembly corresponding to this particular function seems to be in here: Lines 1773 to 1785 in 2ec4e67 However, I think there could still be at least some value for making |
|
Could you please split the scalar Your simd code would be good in the not-nasm situation but maybe it would need further discussion. |
|
Could we turn this into an implementation for a non-nasm platform, e.g. ARM/PPC/RISC-V, that doesn't implement CDEF yet? |
|
We probably can, I just have to figure out how to do that. I will mark the PR as WIP for now as I try to learn how to write the SIMD implementation for other platforms. |
This also now requires BMI1 and BMI2 for AVX2 in
CpuFeatureLevel(for tzcnt).