Replies: 1 comment 5 replies
-
|
This looks feasible. Have we tried to get some actual performance results with hip examples? If not I can try that out. My concerns are that (hopefully I am wrong):
|
Beta Was this translation helpful? Give feedback.
5 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
cc @lialan @qedawkins @Max191 @Muzammiluddin-Syed-ECE @kuhar
We have two parallel tracks of codegen improvement work going on: using LDS DMA in more cases and using something like the XOR swizzle for bank conflict avoidance. Currently, the XOR swizzle is being adopted for scaled matmul, but it would be good - in that it'll save us LDS space and reduce bank conflicts - if it were adopted in cases where we're using gather_to_lds to DMA in operations.
Now, this approach poses some problems - the general idea of an LDS swizzle, as implemented today, is that you do a late-codegen transformation of the program where, if you would read/write from address X, you instead read/write s(X) instead, where s is some pure function of the address chose to de-conflict everyone.
However, our DMA machinery isn't able to apply such swizzles, as the values read from global memory are always written out to LDS in a consecutive stripe.
However (ignoring the entire second can of worms that is transposing read), for some swizzles, there's a way around this. Namely, we can push the swizzle on the address computation. If our unswizzled pattern was going to put data at LDS address X, we now need that data at s^-1(X) so that when each lane does the swizzle, it'll get the data it's meant to.
This only really works with things like the XOR swizzle that are invertible and still look like a gather afterwards (so it doesn't work with padding) but it seems like a piece of infrastructure we should consider building out.
Beta Was this translation helpful? Give feedback.
All reactions