Add AVX2 implementation of `first_max_element` by redzic · Pull Request #2806 · xiph/rav1e

redzic · 2021-09-15T19:58:14Z

This also now requires BMI1 and BMI2 for AVX2 in CpuFeatureLevel (for tzcnt).

tdaede · 2021-09-15T21:23:15Z

Resulting assembly:

example::first_max_element_avx2:
        vmovdqu xmm0, xmmword ptr [rdi]
        vpmaxsd xmm0, xmm0, xmmword ptr [rdi + 16]
        vpshufd xmm1, xmm0, 238
        vpmaxsd xmm0, xmm0, xmm1
        vpshufd xmm1, xmm0, 85
        vpmaxsd xmm0, xmm0, xmm1
        vmovd   edx, xmm0
        vpbroadcastd    ymm0, xmm0
        vpcmpeqd        ymm0, ymm0, ymmword ptr [rdi]
        vmovmskps       eax, ymm0
        tzcnt   eax, eax
        vzeroupper
        ret

tdaede · 2021-09-15T21:29:28Z

It looks like the autovectorizer fails to use ymm registers for the max. Maybe an intrinsic is needed there too?

coveralls · 2021-09-15T21:32:41Z

Coverage decreased (-0.1%) to 83.273% when pulling cff1af9 on redzic:find-first-max-avx2 into 2ec4e67 on xiph:master.

redzic · 2021-09-15T22:04:45Z

It looks like the autovectorizer fails to use ymm registers for the max. Maybe an intrinsic is needed there too?

I don't think it's possible to use ymm registers for getting the max of 8 32-bit integers, or if it is there probably wouldn't be any benefit.

As far as I understand, this is how the algorithm LLVM uses works:

max:
        ; rdi is a pointer to the first element, and it is statically known that there are 8 elements

        ; load the first 4 elements of the slice into xmm0
        vmovdqu xmm0, xmmword ptr [rdi]

        ; do a packed max the last 4 elements of the slice (rdi + 16)
        ; basically, xmm0 now looks like this (where x represents the original 8-element array):
        ; xmm0 = [max(x[0], x[4]), max(x[1], x[5]), max(x[2], x[6]), max(x[3], x[7])]
        ;
        ; at this point, we can forget about the entire array and just worry about finding the max of
        ; the 4 elements that are now in xmm0, since the max is definitely inside xmm0
        vpmaxsd xmm0, xmm0, xmmword ptr [rdi + 16]

        ; shuffle: xmm1 = xmm0[2,3,2,3]
        vpshufd xmm1, xmm0, 238

        ; packed max of xmm0 and xmm1, store the result in xmm0
        ; here is what the registers look like before this step:
        ; xmm0 = [x[0], x[1], x[2], x[3]]
        ; xmm1 = [x[2], x[3], x[2], x[3]]
        ; so the last 2 elements of xmm0 after this instruction are going to be irrelevant,
        ; and as you can see, the max will be either the first or second element of xmm0
        vpmaxsd xmm0, xmm0, xmm1

        ; shuffle : xmm1 = xmm0[1,1,1,1]
        ; xmm0 = [max(x[0], x[2]), max(x[1], x[3]), ...]
        ; xmm1 = [max(x[1], x[3]), max(x[1], x[3]), ...]
        ; (other elements omitted since we don't care about them any more)
        vpshufd xmm1, xmm0, 85

        ; packed max of xmm0 and xmm1, store the result in xmm0
        vpmaxsd xmm0, xmm0, xmm1

        ; now xmm0 = [max(max(x[0], x[2]), max(x[1], x[3])), ...]

        ; so move the first element of xmm0, which is the final max, into eax
        vmovd   eax, xmm0
        ret

tdaede · 2021-09-15T22:15:35Z

Ah yeah sorry, yeah the first max comparison can only be 4 vs 4.

src/cdef.rs

This also now requires BMI1 and BMI2 for AVX2 in `CpuFeatureLevel`.

KyleSiefring · 2021-09-16T02:00:14Z

I'm confused. Isn't this already optimized from the assembly called from here? https://github.com/xiph/rav1e/blob/master/src/asm/x86/cdef.rs#L195

redzic · 2021-09-16T03:52:14Z

@KyleSiefring I think you're right, I didn't notice this.

If I'm not mistaken, the actual assembly corresponding to this particular function seems to be in here:

rav1e/src/x86/cdef_avx2.asm

Lines 1773 to 1785 in 2ec4e67

    
               pmaxsd         xm2, xm0, xm1 
        
               pshufd         xm3, xm2, q1032 
        
               pmaxsd         xm2, xm3 
        
               pshufd         xm3, xm2, q2301 
        
               pmaxsd         xm2, xm3 ; best cost 
        
               ; find the idx using minpos 
        
               ; make everything other than the best cost negative via subtraction 
        
               ; find the min of unsigned 16-bit ints to sort out the negative values 
        
               psubd          xm4, xm1, xm2 
        
               psubd          xm3, xm0, xm2 
        
               packssdw       xm3, xm4 
        
               phminposuw     xm3, xm3

However, I think there could still be at least some value for making first_max_element in the rust code take a &[i32; 8], as it allows the compiler to eliminate some bounds checks and generate less code.

lu-zero · 2021-09-16T09:46:04Z

Could you please split the scalar &[i32; 8] part from the patch?

Your simd code would be good in the not-nasm situation but maybe it would need further discussion.

tdaede · 2021-09-17T20:55:55Z

Could we turn this into an implementation for a non-nasm platform, e.g. ARM/PPC/RISC-V, that doesn't implement CDEF yet?

redzic · 2021-09-18T00:39:53Z

We probably can, I just have to figure out how to do that. I will mark the PR as WIP for now as I try to learn how to write the SIMD implementation for other platforms.

redzic · 2021-11-16T03:58:06Z

@lu-zero @tdaede Now that std::simd has reached nightly, we could probably implement a portable SIMD version of this function with it, but gate it behind nightly, similar to how aarch64 intrinsics are currently being used.

redzic force-pushed the find-first-max-avx2 branch 3 times, most recently from 248542a to 28cb0a9 Compare September 15, 2021 20:52

redzic mentioned this pull request Sep 15, 2021

cdef: enforce 8 element size for first_max_element #2807

Closed

redzic force-pushed the find-first-max-avx2 branch from 28cb0a9 to db76e3c Compare September 15, 2021 21:18

tdaede approved these changes Sep 15, 2021

View reviewed changes

lu-zero reviewed Sep 15, 2021

View reviewed changes

src/cdef.rs Outdated Show resolved Hide resolved

redzic force-pushed the find-first-max-avx2 branch from db76e3c to 47ad5a4 Compare September 15, 2021 23:00

Add AVX2 implementation of first_max_element

cff1af9

This also now requires BMI1 and BMI2 for AVX2 in `CpuFeatureLevel`.

redzic force-pushed the find-first-max-avx2 branch from 47ad5a4 to cff1af9 Compare September 15, 2021 23:06

redzic marked this pull request as draft September 18, 2021 00:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add AVX2 implementation of `first_max_element`#2806

Add AVX2 implementation of `first_max_element`#2806
redzic wants to merge 1 commit intoxiph:masterfrom
redzic:find-first-max-avx2

redzic commented Sep 15, 2021

Uh oh!

tdaede commented Sep 15, 2021

Uh oh!

tdaede commented Sep 15, 2021

Uh oh!

coveralls commented Sep 15, 2021 •

edited

Loading

Uh oh!

redzic commented Sep 15, 2021 •

edited

Loading

Uh oh!

tdaede commented Sep 15, 2021

Uh oh!

Uh oh!

KyleSiefring commented Sep 16, 2021

Uh oh!

redzic commented Sep 16, 2021

Uh oh!

lu-zero commented Sep 16, 2021

Uh oh!

tdaede commented Sep 17, 2021

Uh oh!

redzic commented Sep 18, 2021

Uh oh!

redzic commented Nov 16, 2021 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

redzic commented Sep 15, 2021

Uh oh!

tdaede commented Sep 15, 2021

Uh oh!

tdaede commented Sep 15, 2021

Uh oh!

coveralls commented Sep 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

redzic commented Sep 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tdaede commented Sep 15, 2021

Uh oh!

Uh oh!

KyleSiefring commented Sep 16, 2021

Uh oh!

redzic commented Sep 16, 2021

Uh oh!

lu-zero commented Sep 16, 2021

Uh oh!

tdaede commented Sep 17, 2021

Uh oh!

redzic commented Sep 18, 2021

Uh oh!

redzic commented Nov 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

coveralls commented Sep 15, 2021 •

edited

Loading

redzic commented Sep 15, 2021 •

edited

Loading

redzic commented Nov 16, 2021 •

edited

Loading