Releases: NVIDIA/Model-Optimizer
Releases · NVIDIA/Model-Optimizer
ModelOpt 0.33.1 Release
Bug Fixes
- Fix a Qwen3 MOE model export issue.
ModelOpt 0.33.0 Release
Backward Breaking Changes
- PyTorch dependencies for
modelopt.torchfeatures are no longer optional andpip install nvidia-modeloptis now same aspip install nvidia-modelopt[torch].
New Features
- Upgrade TensorRT-LLM dependency to 0.20.
- Add new CNN QAT example to demonstrate how to use ModelOpt for QAT.
- Add support for ONNX models with custom TensorRT ops in Autocast.
- Add quantization aware distillation (QAD) support in
llm_qatexample. - Add support for BF16 in ONNX quantization.
- Add per node calibration support in ONNX quantization.
- ModelOpt now supports quantization of tensor-parallel sharded Huggingface transformer models. This requires
transformers>=4.52.0. - Support quantization of FSDP2 wrapped models and add FSDP2 support in the
llm_qatexample. - Add NeMo 2 Simplified Flow examples for quantization aware training/distillation (QAT/QAD), speculative decoding, pruning & distillation.
ModelOpt 0.31.0 Release
Backward Breaking Changes
- NeMo and Megatron-LM distributed checkpoint (
torch-dist) stored with legacy version can no longer be loaded. The remedy is to load the legacy distributed checkpoint with 0.29 and store atorchcheckpoint and resume with 0.31 to convert to a new format. The following changes only apply to storing and resuming distributed checkpoint.quantizer_stateof :class:TensorQuantizer <modelopt.torch.quantization.nn.modules.TensorQuantizer>is now stored inextra_stateof :class:QuantModule <modelopt.torch.quantization.nn.module.QuantModule>where it used to be stored in the shardedmodelopt_state.- The dtype and shape of
amaxandpre_quant_scalestored in the distributed checkpoint are now retored. Some dtype and shape are previously changed to make all decoder layers to have homogeneous structure in the checkpoint. - Togather with megatron.core-0.13, quantized model will store and resume distributed checkpoint in a heterogenous format.
- auto_quantize API now accepts a list of quantization config dicts as the list of quantization choices.
- This API previously accepts a list of strings of quantization format names. It was therefore limited to only pre-defined quantization formats unless through some hacks.
- With this change, now user can easily use their own custom quantization formats for auto_quantize.
- In addition, the
quantization_formatsnow excludeNone(indicating "do not quantize") as a valid format because the auto_quantize internally always add "do not quantize" as an option anyway.
- Model export config is refactored. The quant config in
hf_quant_config.jsonis converted and saved toconfig.json.hf_quant_config.jsonwill be deprecated soon.
Deprecations
- Deprecate
Python 3.9support.
New Features
- Upgrade LLM examples to use TensorRT-LLM 0.19.
- Add new model support in the
llm_ptqexample: Qwen3 MoE. - ModelOpt now supports advanced quantization algorithms such as AWQ, SVDQuant and SmoothQuant for cpu-offloaded Huggingface models.
- Add AutoCast tool to convert ONNX models to FP16 or BF16.
- Add
--low_memory_modeflag in the llm_ptq example support to initialize HF models with compressed weights and reduce peak memory of PTQ and quantized checkpoint export.
ModelOpt 0.29.0 Release
Backward Breaking Changes
- Refactor
SequentialQuantizerto improve its implementation and maintainability while preserving its functionality.
Deprecations
- Deprecate
torch<2.4support.
New Features
- Upgrade LLM examples to use TensorRT-LLM 0.18.
- Add new model support in the
llm_ptqexample: Gemma-3, Llama-Nemotron. - Add INT8 real quantization support.
- Add an FP8 GEMM per-tensor quantization kernel for real quantization. After PTQ, you can leverage the
mtq.compress <modelopt.torch.quantization.compress>API to accelerate evaluation of quantized models. - Use the shape of Pytorch parameters and buffers of
TensorQuantizer <modelopt.torch.quantization.nn.modules.TensorQuantizer>to initialize them during restore. This makes quantized model restoring more robust. - Support adding new custom quantization calibration algorithms. Please refer to
mtq.calibrate <modelopt.torch.quantization.model_quant.calibrate>or custom calibration algorithm doc for more details. - Add EAGLE3 (
LlamaForCausalLMEagle3) training and unified ModelOpt checkpoint export support for Megatron-LM. - Add support for
--override_shapesflag to ONNX quantization.--calibration_shapesis reserved for the input shapes used for calibration process.--override_shapesis used to override the input shapes of the model with static shapes.
- Add support for UNet ONNX quantization.
- Enable
concat_eliminationpass by default to improve the performance of quantized ONNX models. - Enable Redundant Cast elimination pass by default in
moq.quantize <modelopt.onnx.quantization.quantize>. - Add new attribute
parallel_statetoDynamicModule <modelopt.torch.opt.dynamic.DynamicModule>to support distributed parallelism such as data parallel and tensor parallel. - Add MXFP8, NVFP4 quantized ONNX export support.
- Add new example for torch quantization to ONNX for MXFP8, NVFP4 precision.
ModelOpt 0.27.1 Release
Add experimental quantization support for Llama4, QwQ and Qwen MOE models.
ModelOpt 0.27.0 Release
Deprecations
- Deprecate real quantization configs, please use
mtq.compress <modelopt.torch.quantization.compress>API for model compression after quantization.
New Features
- New model support in the
llm_ptqexample: OpenAI Whisper. - Blockwise FP8 quantization support in unified model export.
- Add quantization support to the Transformer Engine Linear module.
- Add support for SVDQuant. Currently, only simulation is available; real deployment (for example, TensorRT deployment) support is coming soon.
- To support distributed checkpoint resume expert-parallel (EP),
modelopt_statein Megatron Core distributed checkpoint (used in NeMo and Megatron-LM) is stored differently. The legacymodelopt_statein the distributed checkpoint generated by previous modelopt version can still be loaded in 0.27 and 0.29 but will need to be stored in the new format. - Add triton-based NVFP4 quantization kernel that delivers approximately 40% performance improvement over the previous implementation.
- Add a new API
mtq.compress <modelopt.torch.quantization.compress>for model compression for weights after quantization. - Add option to simplify ONNX model before quantization is performed.
- (Experimental) Improve support for ONNX models with custom TensorRT op:
- Add support for
--calibration_shapesflag. - Add automatic type and shape tensor propagation for full ORT support with TensorRT EP.
- Add support for
Known Issues
- Quantization of T5 models is broken. Please use
nvidia-modelopt==0.25.0withtransformers<4.50meanwhile.
ModelOpt 0.25.0 Release
Deprecations
- Deprecate Torch 2.1 support.
- Deprecate
humanevalbenchmark inllm_evalexamples. Please use the newly addedsimple_evalinstead. - Deprecate
fp8_naivequantization format inllm_ptqexamples. Please usefp8instead.
New Features
- Support fast hadamard transform in
TensorQuantizerclass (modelopt.torch.quantization.nn.modules.TensorQuantizer).
It can be used for rotation based quantization methods, e.g. QuaRot. Users need to install the package fast_hadamard_transfrom to use this feature. - Add affine quantization support for the KV cache, resolving the low accuracy issue in models such as Qwen2.5 and Phi-3/3.5.
- Add FSDP2 support. FSDP2 can now be used for QAT.
- Add LiveCodeBench and Simple Evals to the
llm_evalexamples. - Disabled saving modelopt state in unified hf export APIs by default, i.e., added
save_modelopt_stateflag inexport_hf_checkpointAPI and by default set to False. - Add FP8 and NVFP4 real quantization support with LLM QLoRA example.
- The
modelopt.deploy.llm.LLMclass now support use thetensorrt_llm._torch.LLMbackend for the quantized HuggingFace checkpoints. - Add NVFP4 PTQ example for DeepSeek-R1.
- Add end-to-end AutoDeploy example for AutoQuant LLM models.
ModelOpt 0.23.2 Release
Fix export for Nvidia NeMo models
ModelOpt 0.23.1 Release
Bug Fixes
- Set
torch.load(..., weights_only=False)where Model Optimizer state is restored since torch 2.6 makes the default value toTrue - Other minor fixes
ModelOpt 0.23.0 - First OSS Release!
Backward Breaking Changes
- Nvidia TensorRT Model Optimizer has changed its LICENSE from NVIDIA Proprietary (library wheel) and MIT (examples) to Apache 2.0 in this first full OSS release.
- Deprecate Python 3.8, Torch 2.0, and Cuda 11.x support.
- ONNX Runtime dependency upgraded to 1.20 which no longer supports Python 3.9.
- In the Huggingface examples, the
trust_remote_codeis by default set to false and require users to explicitly turning it on with--trust_remote_codeflag.
New Features
- Added OCP Microscaling Formats (MX) for fake quantization support, including FP8 (E5M2, E4M3), FP6 (E3M2, E2M3), FP4, INT8.
- Added NVFP4 quantization support for NVIDIA Blackwell GPUs along with updated examples.
- Allows export lm_head quantized TensorRT-LLM checkpoint. Quantize lm_head could benefit smaller sized models at a potential cost of additional accuracy loss.
- TensorRT-LLM now supports Moe FP8 and w4a8_awq inference on SM89 (Ada) GPUs.
- New models support in the
llm_ptqexample: Llama 3.3, Phi 4. - Added Minitron pruning support for NeMo 2.0 GPT models.
- Exclude modules in TensorRT-LLM export configs are now wildcards
- The unified llama3.1 FP8 huggingface checkpoints can be deployed on SGLang.