[CK Tile][Critical][Performance] Slow CK Tile GEMM compared to universal_gemm in the old CK

**[Migrated from original issue] ROCm/composable_kernel#1727**

*Original issue author: @zjing14*

Tried CK Tile GEMM with V3 pipeline (https://github.com/ROCm/composable_kernel/blob/develop/example/ck_tile/03_gemm/universal_gemm.cpp) for compute bound cases (i.e., M = 4096, N = 4096 and K = 4096), but get much worse performance than (https://github.com/ROCm/composable_kernel/blob/develop/example/01_gemm/gemm_xdl_bf16_v3.cpp) with the same tile size 256x256x64.

CK Tile V3 (359.483 Tflops)
```
./bin/tile_example_universal_gemm -m=4096 -n=4096 -k=4096 -v=0
Launching kernel with args: grid: {16, 16, 1}, blocks: {256, 1, 1}
Run Gemm kernel with M =4096 N =4096 K =4096 StrideA =4096 StrideB =4096 StrideC =4096 : 0.382324 ms, 359.483 TFlops, 263.293 GB/s,
```
vs. Old CK GEMM V3 (615.46 TFlops)
```
./bin/example_gemm_xdl_bf16_v3 0 2 1 4096 4096 4096 4096 4096 4096 1
a_m_k: dim 2, lengths {4096, 4096}, strides {4096, 1}
b_k_n: dim 2, lengths {4096, 4096}, strides {1, 4096}
c_m_n: dim 2, lengths {4096, 4096}, strides {4096, 1}
RotatingMemWrapper: { size_a: 33554432, size_b: 33554432, rotating_count: 4}
Perf: 0.223311 ms, 615.46 TFlops, 450.776 GB/s, DeviceGemmXdlUniversal<Default, RCR> BlkSize: 256, BlkTile: 256x256x64, WaveTile: 32x32, WaveMap: 4x4, VmemReadVec: 8x8, BlkGemmPipelineScheduler: Intrawave, BlkGemmPipelineVersion: v3, BlkGemmPipelinePrefetchStages: 2
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CK Tile][Critical][Performance] Slow CK Tile GEMM compared to universal_gemm in the old CK #4244

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[CK Tile][Critical][Performance] Slow CK Tile GEMM compared to universal_gemm in the old CK #4244

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions