Highlights
Fused NN Kernels
High-performance fused kernels with 2-14x speedup:
- SwiGLU:
silu(gate) * up(used in Qwen, LLaMA3, Mistral FFN) - RMSNorm+Residual:
rmsnorm(x + residual) * gamma - GeGLU:
gelu(gate) * up
| Kernel | Batch | Speedup |
|---|---|---|
| SwiGLU | 32 | 14.25x |
| RMSNorm+Residual | 128 | 12.37x |
| GeGLU | 32 | 13.10x |
Flash Attention 3 SM120
- TMA (Tensor Memory Accelerator) for efficient global memory access
- Producer/consumer warp architecture for overlapped compute and memory ops
- Tunable configurations for different sequence lengths
FP8 Block-Scale MMA
- Native PTX
mma.syncfor FP8 with per-block scaling - Enables future W8A8 quantized inference paths
Other Improvements
- Native Conv1d CUDA kernel
- LLaMA 4 native CUDA kernels
- Llama Guard 3 content safety classifier example
Full Changelog: v0.2.19...v0.2.20