Releases · NVIDIA/Model-Optimizer

12 Aug 18:50

kevalmorabia97

0.33.1

55b9106

ModelOpt 0.33.1 Release

Bug Fixes

Fix a Qwen3 MOE model export issue.

Assets 2

14 Jul 18:18

kevalmorabia97

0.33.0

7a27f2a

ModelOpt 0.33.0 Release

Backward Breaking Changes

PyTorch dependencies for modelopt.torch features are no longer optional and pip install nvidia-modelopt is now same as pip install nvidia-modelopt[torch].

New Features

Upgrade TensorRT-LLM dependency to 0.20.
Add new CNN QAT example to demonstrate how to use ModelOpt for QAT.
Add support for ONNX models with custom TensorRT ops in Autocast.
Add quantization aware distillation (QAD) support in llm_qat example.
Add support for BF16 in ONNX quantization.
Add per node calibration support in ONNX quantization.
ModelOpt now supports quantization of tensor-parallel sharded Huggingface transformer models. This requires transformers>=4.52.0.
Support quantization of FSDP2 wrapped models and add FSDP2 support in the llm_qat example.
Add NeMo 2 Simplified Flow examples for quantization aware training/distillation (QAT/QAD), speculative decoding, pruning & distillation.

Assets 2

05 Jun 21:02

kevalmorabia97

0.31.0

5a2bf34

ModelOpt 0.31.0 Release

Backward Breaking Changes

NeMo and Megatron-LM distributed checkpoint (torch-dist) stored with legacy version can no longer be loaded. The remedy is to load the legacy distributed checkpoint with 0.29 and store a torch checkpoint and resume with 0.31 to convert to a new format. The following changes only apply to storing and resuming distributed checkpoint.
- quantizer_state of :class:TensorQuantizer <modelopt.torch.quantization.nn.modules.TensorQuantizer> is now stored in extra_state of :class:QuantModule <modelopt.torch.quantization.nn.module.QuantModule> where it used to be stored in the sharded modelopt_state.
- The dtype and shape of amax and pre_quant_scale stored in the distributed checkpoint are now retored. Some dtype and shape are previously changed to make all decoder layers to have homogeneous structure in the checkpoint.
- Togather with megatron.core-0.13, quantized model will store and resume distributed checkpoint in a heterogenous format.
auto_quantize API now accepts a list of quantization config dicts as the list of quantization choices.
- This API previously accepts a list of strings of quantization format names. It was therefore limited to only pre-defined quantization formats unless through some hacks.
- With this change, now user can easily use their own custom quantization formats for auto_quantize.
- In addition, the quantization_formats now exclude None (indicating "do not quantize") as a valid format because the auto_quantize internally always add "do not quantize" as an option anyway.
Model export config is refactored. The quant config in hf_quant_config.json is converted and saved to config.json. hf_quant_config.json will be deprecated soon.

Deprecations

Deprecate Python 3.9 support.

New Features

Upgrade LLM examples to use TensorRT-LLM 0.19.
Add new model support in the llm_ptq example: Qwen3 MoE.
ModelOpt now supports advanced quantization algorithms such as AWQ, SVDQuant and SmoothQuant for cpu-offloaded Huggingface models.
Add AutoCast tool to convert ONNX models to FP16 or BF16.
Add --low_memory_mode flag in the llm_ptq example support to initialize HF models with compressed weights and reduce peak memory of PTQ and quantized checkpoint export.

Assets 2

09 May 05:26

kevalmorabia97

0.29.0

acecdb5

ModelOpt 0.29.0 Release

Backward Breaking Changes

Refactor SequentialQuantizer to improve its implementation and maintainability while preserving its functionality.

Deprecations

Deprecate torch<2.4 support.

New Features

Upgrade LLM examples to use TensorRT-LLM 0.18.
Add new model support in the llm_ptq example: Gemma-3, Llama-Nemotron.
Add INT8 real quantization support.
Add an FP8 GEMM per-tensor quantization kernel for real quantization. After PTQ, you can leverage the mtq.compress <modelopt.torch.quantization.compress> API to accelerate evaluation of quantized models.
Use the shape of Pytorch parameters and buffers of TensorQuantizer <modelopt.torch.quantization.nn.modules.TensorQuantizer> to initialize them during restore. This makes quantized model restoring more robust.
Support adding new custom quantization calibration algorithms. Please refer to mtq.calibrate <modelopt.torch.quantization.model_quant.calibrate> or custom calibration algorithm doc for more details.
Add EAGLE3 (LlamaForCausalLMEagle3) training and unified ModelOpt checkpoint export support for Megatron-LM.
Add support for --override_shapes flag to ONNX quantization.
- --calibration_shapes is reserved for the input shapes used for calibration process.
- --override_shapes is used to override the input shapes of the model with static shapes.
Add support for UNet ONNX quantization.
Enable concat_elimination pass by default to improve the performance of quantized ONNX models.
Enable Redundant Cast elimination pass by default in moq.quantize <modelopt.onnx.quantization.quantize>.
Add new attribute parallel_state to DynamicModule <modelopt.torch.opt.dynamic.DynamicModule> to support distributed parallelism such as data parallel and tensor parallel.
Add MXFP8, NVFP4 quantized ONNX export support.
Add new example for torch quantization to ONNX for MXFP8, NVFP4 precision.

Assets 2

15 Apr 18:24

kevalmorabia97

0.27.1

d59ca04

ModelOpt 0.27.1 Release

Add experimental quantization support for Llama4, QwQ and Qwen MOE models.

Assets 2

03 Apr 05:24

kevalmorabia97

0.27.0

54f4e3c

ModelOpt 0.27.0 Release

Deprecations

Deprecate real quantization configs, please use mtq.compress <modelopt.torch.quantization.compress> API for model compression after quantization.

New Features

New model support in the llm_ptq example: OpenAI Whisper.
Blockwise FP8 quantization support in unified model export.
Add quantization support to the Transformer Engine Linear module.
Add support for SVDQuant. Currently, only simulation is available; real deployment (for example, TensorRT deployment) support is coming soon.
To support distributed checkpoint resume expert-parallel (EP), modelopt_state in Megatron Core distributed checkpoint (used in NeMo and Megatron-LM) is stored differently. The legacy modelopt_state in the distributed checkpoint generated by previous modelopt version can still be loaded in 0.27 and 0.29 but will need to be stored in the new format.
Add triton-based NVFP4 quantization kernel that delivers approximately 40% performance improvement over the previous implementation.
Add a new API mtq.compress <modelopt.torch.quantization.compress> for model compression for weights after quantization.
Add option to simplify ONNX model before quantization is performed.
(Experimental) Improve support for ONNX models with custom TensorRT op:
- Add support for --calibration_shapes flag.
- Add automatic type and shape tensor propagation for full ORT support with TensorRT EP.

Known Issues

Quantization of T5 models is broken. Please use nvidia-modelopt==0.25.0 with transformers<4.50 meanwhile.

Assets 2

03 Mar 17:41

kevalmorabia97

0.25.0

7eecd11

ModelOpt 0.25.0 Release

Deprecations

Deprecate Torch 2.1 support.
Deprecate humaneval benchmark in llm_eval examples. Please use the newly added simple_eval instead.
Deprecate fp8_naive quantization format in llm_ptq examples. Please use fp8 instead.

New Features

Support fast hadamard transform in TensorQuantizer class (modelopt.torch.quantization.nn.modules.TensorQuantizer).
It can be used for rotation based quantization methods, e.g. QuaRot. Users need to install the package fast_hadamard_transfrom to use this feature.
Add affine quantization support for the KV cache, resolving the low accuracy issue in models such as Qwen2.5 and Phi-3/3.5.
Add FSDP2 support. FSDP2 can now be used for QAT.
Add LiveCodeBench and Simple Evals to the llm_eval examples.
Disabled saving modelopt state in unified hf export APIs by default, i.e., added save_modelopt_state flag in export_hf_checkpoint API and by default set to False.
Add FP8 and NVFP4 real quantization support with LLM QLoRA example.
The modelopt.deploy.llm.LLM class now support use the tensorrt_llm._torch.LLM backend for the quantized HuggingFace checkpoints.
Add NVFP4 PTQ example for DeepSeek-R1.
Add end-to-end AutoDeploy example for AutoQuant LLM models.

Assets 2

19 Feb 12:27

kevalmorabia97

0.23.2

25090b0

ModelOpt 0.23.2 Release

Fix export for Nvidia NeMo models

Assets 2

14 Feb 10:50

kevalmorabia97

0.23.1

5c9390c

ModelOpt 0.23.1 Release

Bug Fixes

Set torch.load(..., weights_only=False) where Model Optimizer state is restored since torch 2.6 makes the default value to True
Other minor fixes

Assets 2

29 Jan 19:05

kevalmorabia97

0.23.0

73d6af7

ModelOpt 0.23.0 - First OSS Release!

Backward Breaking Changes

Nvidia TensorRT Model Optimizer has changed its LICENSE from NVIDIA Proprietary (library wheel) and MIT (examples) to Apache 2.0 in this first full OSS release.
Deprecate Python 3.8, Torch 2.0, and Cuda 11.x support.
ONNX Runtime dependency upgraded to 1.20 which no longer supports Python 3.9.
In the Huggingface examples, the trust_remote_code is by default set to false and require users to explicitly turning it on with --trust_remote_code flag.

New Features

Added OCP Microscaling Formats (MX) for fake quantization support, including FP8 (E5M2, E4M3), FP6 (E3M2, E2M3), FP4, INT8.
Added NVFP4 quantization support for NVIDIA Blackwell GPUs along with updated examples.
Allows export lm_head quantized TensorRT-LLM checkpoint. Quantize lm_head could benefit smaller sized models at a potential cost of additional accuracy loss.
TensorRT-LLM now supports Moe FP8 and w4a8_awq inference on SM89 (Ada) GPUs.
New models support in the llm_ptq example: Llama 3.3, Phi 4.
Added Minitron pruning support for NeMo 2.0 GPT models.
Exclude modules in TensorRT-LLM export configs are now wildcards
The unified llama3.1 FP8 huggingface checkpoints can be deployed on SGLang.

Assets 2

Releases: NVIDIA/Model-Optimizer

ModelOpt 0.33.1 Release

Uh oh!

ModelOpt 0.33.0 Release

Uh oh!

ModelOpt 0.31.0 Release

Uh oh!

ModelOpt 0.29.0 Release

Uh oh!

ModelOpt 0.27.1 Release

Uh oh!

ModelOpt 0.27.0 Release

Uh oh!

ModelOpt 0.25.0 Release

Deprecations

New Features

Uh oh!

ModelOpt 0.23.2 Release

Uh oh!

ModelOpt 0.23.1 Release

Uh oh!

ModelOpt 0.23.0 - First OSS Release!

Uh oh!