[Question]  DSA  VS MLA  Prefill Benchmark On H100

1. Motivation

In the DeepSeek-V3.2 paper, the "Inference costs of DeepSeek-V3.1-Terminus and DeepSeek-V3.2-Exp on H800 clusters" shows that DSA outperforms MLA in the prefilling stage when sequence length exceeds ~12K ().

However, in my profiling on H100 (PCIe/SXM), I cannot reproduce this crossover point.


<img width="362" height="112" alt="Image" src="https://github.com/user-attachments/assets/d896f16e-f950-45ee-a25b-b032a48b137f" />

<img width="873" height="743" alt="Image" src="https://github.com/user-attachments/assets/c4511d3d-bc83-46f6-bc41-c33b41bed0ae" />



2. Micro-benchmark Implementation
 
DSA path: deep_gemm.fp8_mqa_logits -> mock_topk_index -> flash_mla_sparse_fwd.
MLA path: Standard FlashAttention-3

Key Parameters:
DSA Config: qk_dim=576, v_dim=512, n_heads_q=128, n_heads_k=1
Index Config: qk_dim=128, n_heads_q=64, n_heads_k=1.
MLA Config: qk_dim=192, v_dim=192, n_heads_q=128, n_heads_kv=128.

Precision: torch.bfloat16 (Main) / e4m3 (index).



https://github.com/ZavierXing/FlashMLA/blob/bench/benchmark/bench_dsa.py

- GPU: NVIDIA H100 NVL
- CUDA Version: 12.6
- PyTorchVersion: 2.6.0
- tilelang Version: 0.1.7



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] DSA VS MLA Prefill Benchmark On H100 #149

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Question] DSA VS MLA Prefill Benchmark On H100 #149

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions