You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Fix Megatron KV Cache quantization checkpoint restore for QAT/QAD (device placement, amax sync across DP/TP, flash_decode compatibility).
New Features
Add support for Transformer Engine quantization for Megatron Core models.
Add support for Qwen3-Next model quantization.
Add support for dynamically linked TensorRT plugins in the ONNX quantization workflow.
Add support for KV Cache Quantization for vLLM FakeQuant PTQ script. See examples/vllm_serve/README.md for more details.
Add support for subgraphs in ONNX autocast.
Add support for parallel draft heads in Eagle speculative decoding.
Add support to enable custom emulated quantization backend. See register_quant_backend for more details. See an example in tests/unit/torch/quantization/test_custom_backend.py.
Add examples/llm_qad for QAD training with Megatron-LM.
Deprecations
Deprecate num_query_groups parameter in Minitron pruning (mcore_minitron). You can use ModelOpt 0.40.0 or earlier instead if you need to prune it.
Backward Breaking Changes
Remove torchprofile as a default dependency from ModelOpt as it's used only for flops-based FastNAS pruning (computer vision models). It can be installed separately if needed.