-
Notifications
You must be signed in to change notification settings - Fork 984
Open
Description
- Motivation
In the DeepSeek-V3.2 paper, the "Inference costs of DeepSeek-V3.1-Terminus and DeepSeek-V3.2-Exp on H800 clusters" shows that DSA outperforms MLA in the prefilling stage when sequence length exceeds ~12K ().
However, in my profiling on H100 (PCIe/SXM), I cannot reproduce this crossover point.
- Micro-benchmark Implementation
DSA path: deep_gemm.fp8_mqa_logits -> mock_topk_index -> flash_mla_sparse_fwd.
MLA path: Standard FlashAttention-3
Key Parameters:
DSA Config: qk_dim=576, v_dim=512, n_heads_q=128, n_heads_k=1
Index Config: qk_dim=128, n_heads_q=64, n_heads_k=1.
MLA Config: qk_dim=192, v_dim=192, n_heads_q=128, n_heads_kv=128.
Precision: torch.bfloat16 (Main) / e4m3 (index).
https://github.com/ZavierXing/FlashMLA/blob/bench/benchmark/bench_dsa.py
- GPU: NVIDIA H100 NVL
- CUDA Version: 12.6
- PyTorchVersion: 2.6.0
- tilelang Version: 0.1.7
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels