Code accompanying the paper Generalized Gradient Norm Clipping & Non-Euclidean $(L_0,L_1)$-Smoothness.
This paper is a following work of Training Deep Learning Models with Norm-Constrained LMOs and is based on the Scion codebase.
-
clippedscion.py: Contains theUnconstrainedClippedScionandClippedScionreference implementation along with various norm choices.-
Algorithm 3 corresponds to
UnconstrainedClippedScion. -
Algorithm 4 (Variant 2) corresponds to
ClippedScion. For simplicity, we control$\min{\rho, \sum_{l=1}^D \braket{d^k_l,v^k_l}}$ in practice.
-
Algorithm 3 corresponds to
-
examples/: Example usage containing airbench, nanoGPT, and DeiT experiments with and without weight sharing.
The ClippedScion optimizer comes with a couple of hyperparameters:
-
momentum: The parameter is1-usual_momentumof e.g. the PyTorch implementation of SGD with momentum. A good default is 0.1. Higher values seem to work better (e.g. 0.5) for short training runs with low noise as also supported by theory. -
scale: Controls the per-layer constraint radius factor. The layerwise radius can be tuned on a small proxy model similarly to the input and output scaling factor of µP. -
lr: The learning rate can similarly be tuned on a small proxy model (corresponds to γ in the paper). -
unconstrained: When set toFalsethe constrained variant of the ClippedScion is used, which guarantees the iterates to stay bounded. -
rho: Clipping threshold controls$\sum_{l=1}^D \braket{d^k_l,v^k_l}$ in Algorithm 3 & 4.
Architectural changes:
- Scale activation functions (ReLU, GELU) by √2 to maintain the input variance.
For runnable examples see examples/.
Below are some pseudocode configurations for different architectures and domains (see Appendix C for exact parameter choices):
-
nanoGPT with weight sharing (see
examples/modded-nanogpt):radius = 50.0 threshold = 600 optim_groups = [{ 'params': model.transformer.h.parameters(), 'norm': 'Spectral', 'norm_kwargs': {}, 'scale': radius, }, { 'params': model.lm_head.parameters(), 'norm': 'Sign', 'norm_kwargs': {}, 'scale': radius*60.0, }] optimizer = UnconstrainedClippedScion(optim_groups, lr=2**-12, momentum=0.1, rho=600)
-
CNN (see
examples/airbenchfor further details):radius = 8.0 threshold = 1600 optim_groups = [{ 'params': remaining_parameters, 'norm': 'Auto', # Picks layerwise norm based on the parameter shape 'norm_kwargs': {}, 'scale': radius, }, { 'params': output_layer, 'norm': 'Sign', 'norm_kwargs': {'normalized': True}, 'scale': radius*16, }] optimizer = UnconstrainedClippedScion(optim_groups, lr=2**-4, momentum=0.5, rho=1600)
-
DeiT
radius = 25 threshold = 8000 optim_groups = [{ 'params': other_params, 'norm': 'Auto', 'norm_kwargs': {}, 'scale': radius, },{ 'params': head_weights, 'norm': 'Sign', 'norm_kwargs': {}, 'scale': radius*20, },{ 'params': [pos_embed_param, cls_token_param], 'norm': 'BiasRMS', 'norm_kwargs': {}, 'scale': radius, }] optimizer = UnconstrainedClippedScion(optim_groups, lr=8e-5, momentum=0.1, rho=8000)
If you find this work useful, please cite it as follows:
@article{pethick2025generalized,
title={Generalized Gradient Norm Clipping \& Non-Euclidean $(L\_0, L\_1) $-Smoothness},
author={Pethick, Thomas and Xie, Wanyun and Erdogan, Mete and Antonakopoulos, Kimon and Silveti-Falls, Tony and Cevher, Volkan},
journal={arXiv preprint arXiv:2506.01913},
year={2025}
}