decoding-attention

Here are 3 public repositories matching this topic...

Bruce-Lee-LY / decoding_attention

Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.

gpu cuda inference nvidia mha mla multi-head-attention gqa mqa llm large-language-model flash-attention cuda-core decoding-attention flashinfer flashmla

Updated Jun 11, 2025
C++

kamalrss88 / FlashMLA

Star

🚀 Accelerate attention mechanisms with FlashMLA, featuring optimized kernels for DeepSeek models, enhancing performance through sparse and dense attention.

windows gpu cuda inference nvidia nvidia-cuda mla multi-head-attention mqa llm flash-attention cuda-core decoding-attention deepseek flashinfer flashmla

Updated Feb 6, 2026
C++

aymanelrody / FlashMLA

Star

⚡ Optimize attention mechanisms with FlashMLA, a library of advanced sparse and dense kernels for DeepSeek models, improving performance and efficiency.

windows gpu cuda inference nvidia mha mla multi-head-attention gqa mqa llm flash-attention cuda-core decoding-attention deepseek flashinfer flashmla

Updated Feb 6, 2026
C++

Improve this page

Add a description, image, and links to the decoding-attention topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the decoding-attention topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly