LDS DMA and bank-conflict avoidance swizzles #23289

krzysz00 · 2026-01-27T00:15:16Z

krzysz00
Jan 27, 2026
Collaborator

cc @lialan @qedawkins @Max191 @Muzammiluddin-Syed-ECE @kuhar

We have two parallel tracks of codegen improvement work going on: using LDS DMA in more cases and using something like the XOR swizzle for bank conflict avoidance. Currently, the XOR swizzle is being adopted for scaled matmul, but it would be good - in that it'll save us LDS space and reduce bank conflicts - if it were adopted in cases where we're using gather_to_lds to DMA in operations.

Now, this approach poses some problems - the general idea of an LDS swizzle, as implemented today, is that you do a late-codegen transformation of the program where, if you would read/write from address X, you instead read/write s(X) instead, where s is some pure function of the address chose to de-conflict everyone.

However, our DMA machinery isn't able to apply such swizzles, as the values read from global memory are always written out to LDS in a consecutive stripe.

However (ignoring the entire second can of worms that is transposing read), for some swizzles, there's a way around this. Namely, we can push the swizzle on the address computation. If our unswizzled pattern was going to put data at LDS address X, we now need that data at s^-1(X) so that when each lane does the swizzle, it'll get the data it's meant to.

This only really works with things like the XOR swizzle that are invertible and still look like a gather afterwards (so it doesn't work with padding) but it seems like a piece of infrastructure we should consider building out.

lialan · 2026-01-27T00:26:39Z

lialan
Jan 27, 2026
Collaborator

This looks feasible.

Have we tried to get some actual performance results with hip examples? If not I can try that out.

My concerns are that (hopefully I am wrong):

try not to use per-element addressing, which is a bit waste of bandwidth. I tried lds loads on Mi300, using single element direct lds loads is slower (10%) for Gemm ops. So definitely want to avoid that.
Swizzled loads from source for different threads. Would subgroup instructions with irregular source access pattern (but without bank conflict) perform good? If we haven't tried it out we should do experiments first.

5 replies

krzysz00 Jan 27, 2026
Collaborator Author

This is about gfx950 in particular (also, speaking of, we probably shouldn't be enabling DMA if we can't guarantee b128 loads on gfx950)

I don't really understand your second question

lialan Jan 27, 2026
Collaborator

For the second question, I meant: So far we have been using the load lds instruction for copying. Have we actually tried to use it as gather instruction? would that be slower?

First question: I will ensure we only enable when there is 128bit loads available.

krzysz00 Jan 27, 2026
Collaborator Author

For the second question there - I don't know about the performance impact of the gather mode. I will say that, as proposed, the idea is that we'll still be gathering the same data on each wave, it's just that we'll permute the order in which that data is stored so that, when reading it, we won't have bank conflicts, and we'll create that permutation through (probably late-codegen, by analogy to current LDS swizzle handling) manipulation of the address being read from, since that's the only knob we have.

Max191 Jan 27, 2026
Collaborator

I think what Alan may be asking is if we have any signal on whether this will be a good tradeoff. Pushing the swizzle to the global accesses means we will be giving up coalesced access patterns in some cases, so we would be trading bank conflicts for non-coalesced loads. Doing some hip experiments is often a low cost way of getting a proof of concept.

I'm guessing this could be useful in cases where loads are already not coalesced, though, anyway (e.g., if we wanted to do this for convolution).

kuhar Jan 27, 2026
Maintainer

IIUC, the gather mode is only performant if the serialized access is subgroup-contiguous. So we can still swizzle but within the width of a (few) cache lines.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LDS DMA and bank-conflict avoidance swizzles #23289

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

LDS DMA and bank-conflict avoidance swizzles #23289

Uh oh!

krzysz00 Jan 27, 2026 Collaborator

Replies: 1 comment · 5 replies

Uh oh!

lialan Jan 27, 2026 Collaborator

Uh oh!

krzysz00 Jan 27, 2026 Collaborator Author

Uh oh!

lialan Jan 27, 2026 Collaborator

Uh oh!

krzysz00 Jan 27, 2026 Collaborator Author

Uh oh!

Max191 Jan 27, 2026 Collaborator

Uh oh!

kuhar Jan 27, 2026 Maintainer

krzysz00
Jan 27, 2026
Collaborator

Replies: 1 comment 5 replies

lialan
Jan 27, 2026
Collaborator

krzysz00 Jan 27, 2026
Collaborator Author

lialan Jan 27, 2026
Collaborator

krzysz00 Jan 27, 2026
Collaborator Author

Max191 Jan 27, 2026
Collaborator

kuhar Jan 27, 2026
Maintainer