Skip to content

Releases: warner-benjamin/optimi

v0.3.3

26 Dec 22:35

Choose a tag to compare

Highlights

  • add to_low_precision
  • stableadam triton can load from non-triton checkpoint

Full Changelog: v0.3.2...v0.3.3

v0.3.2

29 Jul 06:02

Choose a tag to compare

Full Changelog: v0.3.1...v0.3.2

v0.3.1

25 Jul 21:17

Choose a tag to compare

  • Deprecate foreach optimizers
  • Fix StableAdamW Triton gradient release

Full Changelog: v0.3.0...v0.3.1

v0.3.0: Fast Triton Kernels

15 Jul 20:36
e155965

Choose a tag to compare

This release adds Triton kernels for all optimi optimizers and set's them as the default. optimi's vertically fused Triton kernels are faster than PyTorch's vertically and horizontally fused Cuda kernels and are nearly as fast as compiled optimizers.

Full Changelog: v0.2.1...v0.3.0

v0.2.1: param_groups_weight_decay

06 Jun 00:25

Choose a tag to compare

Adds param_groups_weight_decay which excludes bias and normalization layers from weight decay. param_groups_weight_decay is lightly modified from PyTorch Image Models.

Full Changelog: v0.2.0...v0.2.1

v0.2.0: Gradient Release & Optimizer Accumulation

11 Mar 04:13

Choose a tag to compare

  • Add Gradient Release
  • Add Optimizer Accumulation

Full Changelog: v0.1.2...v0.2.0

v0.1.2

18 Dec 16:05

Choose a tag to compare

Add RAdam and Ranger optimizers.

Full Changelog: v0.1.1...v0.1.2

v0.1.1: Initial Release

19 Nov 01:56

Choose a tag to compare

optimī

Fast, Modern, and Low Precision PyTorch Optimizers

optimi enables accurate low precision training via Kahan summation, supports fully decoupled weight decay, and features fast implementations of modern optimizers.

Low Precision Training with Kahan Summation

optimi optimizers can match the performance of mixed precision when training in BFloat16 by using Kahan summation.

Training in BFloat16 with Kahan summation can reduce non-activation training memory usage by 37.5 to 45.5 percent when using an Adam optimizer. BFloat16 training increases single GPU training speed by ~10 percent at the same batch size.

Fully Decoupled Weight Decay

In addition to supporting PyTorch-style decoupled weight decay, optimi optimizers also support fully decoupled weight decay.

Fully decoupled weight decay decouples weight decay from the learning rate, more accurately following Decoupled Weight Decay Regularization. This can help simplify hyperparameter tuning as the optimal weight decay is no longer tied to the learning rate.

Foreach Implementations

All optimi optimizers have fast foreach implementations, which can significantly outperform the for-loop versions. optimi reuses the gradient buffer for temporary variables to reduce foreach memory usage.

Documentation

https://optimi.benjaminwarner.dev

Install

optimi is available to install from pypi.

pip install torch-optimi

Optimizers

optimi implements the following optimizers: