Skip to content

[Question] DSA VS MLA Prefill Benchmark On H100 #149

@ZavierXing

Description

@ZavierXing
  1. Motivation

In the DeepSeek-V3.2 paper, the "Inference costs of DeepSeek-V3.1-Terminus and DeepSeek-V3.2-Exp on H800 clusters" shows that DSA outperforms MLA in the prefilling stage when sequence length exceeds ~12K ().

However, in my profiling on H100 (PCIe/SXM), I cannot reproduce this crossover point.

Image Image
  1. Micro-benchmark Implementation

DSA path: deep_gemm.fp8_mqa_logits -> mock_topk_index -> flash_mla_sparse_fwd.
MLA path: Standard FlashAttention-3

Key Parameters:
DSA Config: qk_dim=576, v_dim=512, n_heads_q=128, n_heads_k=1
Index Config: qk_dim=128, n_heads_q=64, n_heads_k=1.
MLA Config: qk_dim=192, v_dim=192, n_heads_q=128, n_heads_kv=128.

Precision: torch.bfloat16 (Main) / e4m3 (index).

https://github.com/ZavierXing/FlashMLA/blob/bench/benchmark/bench_dsa.py

  • GPU: NVIDIA H100 NVL
  • CUDA Version: 12.6
  • PyTorchVersion: 2.6.0
  • tilelang Version: 0.1.7

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions