v0.3.0: Fast Triton Kernels
This release adds Triton kernels for all optimi optimizers and set's them as the default. optimi's vertically fused Triton kernels are faster than PyTorch's vertically and horizontally fused Cuda kernels and are nearly as fast as compiled optimizers.
Full Changelog: v0.2.1...v0.3.0