This release adds Triton kernels for all optimi optimizers and set's them as the default. optimi's vertically fused Triton kernels are faster than PyTorch's vertically and horizontally fused Cuda kernels and are nearly as fast as compiled optimizers.

Full Changelog: v0.2.1...v0.3.0

Assets 2

06 Jun 00:25

warner-benjamin

v0.2.1

6d72593

v0.2.1: param_groups_weight_decay

Adds param_groups_weight_decay which excludes bias and normalization layers from weight decay. param_groups_weight_decay is lightly modified from PyTorch Image Models.

Full Changelog: v0.2.0...v0.2.1

Assets 2

11 Mar 04:13

warner-benjamin

v0.2.0

41452f5

v0.2.0: Gradient Release & Optimizer Accumulation

Add Gradient Release
Add Optimizer Accumulation

Full Changelog: v0.1.2...v0.2.0

Assets 2

18 Dec 16:05

warner-benjamin

v0.1.2

cecfc90

v0.1.2

Add RAdam and Ranger optimizers.

Full Changelog: v0.1.1...v0.1.2

Assets 2

19 Nov 01:56

warner-benjamin

v0.1.1

51a6599

v0.1.1: Initial Release

optimī

Fast, Modern, and Low Precision PyTorch Optimizers

optimi enables accurate low precision training via Kahan summation, supports fully decoupled weight decay, and features fast implementations of modern optimizers.

Low Precision Training with Kahan Summation

optimi optimizers can match the performance of mixed precision when training in BFloat16 by using Kahan summation.

Training in BFloat16 with Kahan summation can reduce non-activation training memory usage by 37.5 to 45.5 percent when using an Adam optimizer. BFloat16 training increases single GPU training speed by ~10 percent at the same batch size.

Fully Decoupled Weight Decay

In addition to supporting PyTorch-style decoupled weight decay, optimi optimizers also support fully decoupled weight decay.

Fully decoupled weight decay decouples weight decay from the learning rate, more accurately following Decoupled Weight Decay Regularization. This can help simplify hyperparameter tuning as the optimal weight decay is no longer tied to the learning rate.

Foreach Implementations

All optimi optimizers have fast foreach implementations, which can significantly outperform the for-loop versions. optimi reuses the gradient buffer for temporary variables to reduce foreach memory usage.

Documentation

https://optimi.benjaminwarner.dev

Install

optimi is available to install from pypi.

pip install torch-optimi

Optimizers

optimi implements the following optimizers:

Assets 2

Releases: warner-benjamin/optimi

v0.3.3

Highlights

Uh oh!

v0.3.2

Uh oh!

v0.3.1

Uh oh!

v0.3.0: Fast Triton Kernels

Uh oh!

v0.2.1: param_groups_weight_decay

Uh oh!

v0.2.0: Gradient Release & Optimizer Accumulation

Uh oh!

v0.1.2

Uh oh!

v0.1.1: Initial Release

optimī

Fast, Modern, and Low Precision PyTorch Optimizers

Low Precision Training with Kahan Summation

Fully Decoupled Weight Decay

Foreach Implementations

Documentation

Install

Optimizers

Uh oh!