Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.
-
Updated
Jun 11, 2025 - C++
Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.
🚀 Accelerate attention mechanisms with FlashMLA, featuring optimized kernels for DeepSeek models, enhancing performance through sparse and dense attention.
⚡ Optimize attention mechanisms with FlashMLA, a library of advanced sparse and dense kernels for DeepSeek models, improving performance and efficiency.
Add a description, image, and links to the decoding-attention topic page so that developers can more easily learn about it.
To associate your repository with the decoding-attention topic, visit your repo's landing page and select "manage topics."