Skip to content

ModelOpt 0.41.0 Release

Latest

Choose a tag to compare

@kevalmorabia97 kevalmorabia97 released this 20 Jan 17:10
d39cf45

Bug Fixes

  • Fix Megatron KV Cache quantization checkpoint restore for QAT/QAD (device placement, amax sync across DP/TP, flash_decode compatibility).

New Features

  • Add support for Transformer Engine quantization for Megatron Core models.
  • Add support for Qwen3-Next model quantization.
  • Add support for dynamically linked TensorRT plugins in the ONNX quantization workflow.
  • Add support for KV Cache Quantization for vLLM FakeQuant PTQ script. See examples/vllm_serve/README.md for more details.
  • Add support for subgraphs in ONNX autocast.
  • Add support for parallel draft heads in Eagle speculative decoding.
  • Add support to enable custom emulated quantization backend. See register_quant_backend for more details. See an example in tests/unit/torch/quantization/test_custom_backend.py.
  • Add examples/llm_qad for QAD training with Megatron-LM.

Deprecations

  • Deprecate num_query_groups parameter in Minitron pruning (mcore_minitron). You can use ModelOpt 0.40.0 or earlier instead if you need to prune it.

Backward Breaking Changes

  • Remove torchprofile as a default dependency from ModelOpt as it's used only for flops-based FastNAS pruning (computer vision models). It can be installed separately if needed.