Skip to content

v0.2.20: Fused NN Kernels + Flash Attention 3 SM120

Latest

Choose a tag to compare

@m96-chan m96-chan released this 26 Jan 16:06
224e6bb

Highlights

Fused NN Kernels

High-performance fused kernels with 2-14x speedup:

  • SwiGLU: silu(gate) * up (used in Qwen, LLaMA3, Mistral FFN)
  • RMSNorm+Residual: rmsnorm(x + residual) * gamma
  • GeGLU: gelu(gate) * up
Kernel Batch Speedup
SwiGLU 32 14.25x
RMSNorm+Residual 128 12.37x
GeGLU 32 13.10x

Flash Attention 3 SM120

  • TMA (Tensor Memory Accelerator) for efficient global memory access
  • Producer/consumer warp architecture for overlapped compute and memory ops
  • Tunable configurations for different sequence lengths

FP8 Block-Scale MMA

  • Native PTX mma.sync for FP8 with per-block scaling
  • Enables future W8A8 quantized inference paths

Other Improvements

  • Native Conv1d CUDA kernel
  • LLaMA 4 native CUDA kernels
  • Llama Guard 3 content safety classifier example

Full Changelog: v0.2.19...v0.2.20