v0.2.20: Fused NN Kernels + Flash Attention 3 SM120

Latest

Latest

m96-chan released this 26 Jan 16:06

224e6bb

Highlights

Fused NN Kernels

High-performance fused kernels with 2-14x speedup:

SwiGLU: silu(gate) * up (used in Qwen, LLaMA3, Mistral FFN)
RMSNorm+Residual: rmsnorm(x + residual) * gamma
GeGLU: gelu(gate) * up

Kernel	Batch	Speedup
SwiGLU	32	14.25x
RMSNorm+Residual	128	12.37x
GeGLU	32	13.10x

Flash Attention 3 SM120

TMA (Tensor Memory Accelerator) for efficient global memory access
Producer/consumer warp architecture for overlapped compute and memory ops
Tunable configurations for different sequence lengths

FP8 Block-Scale MMA

Native PTX mma.sync for FP8 with per-block scaling
Enables future W8A8 quantized inference paths

Other Improvements

Native Conv1d CUDA kernel
LLaMA 4 native CUDA kernels
Llama Guard 3 content safety classifier example

Full Changelog: v0.2.19...v0.2.20

Assets 7