-
Notifications
You must be signed in to change notification settings - Fork 255
Description
GitHub Issue Template for NVIDIA Model-Optimizer
Describe the bug
From the "https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/gpt-oss" code, we try to utilize QAT after training with SFT.
When using QATSFTTrainer with a GPT-OSS MoE (Mixture of Experts) model, training fails with a RecursionError: maximum recursion depth exceeded during the evaluation phase.
The issue occurs in _quantized_bmm() function where torch._bmm is intercepted and redirected back to _quantized_bmm, causing infinite recursion (954+ repeated calls).
Impact: Blocker - Cannot perform QAT training on MoE architecture models.
Stack Trace (abbreviated):
RecursionError: maximum recursion depth exceeded
File "modelopt/torch/quantization/plugins/huggingface.py", line 626, in _quantized_bmm
return torch._bmm(batch1, batch2)
[Previous line repeated 954 more times]
File "modelopt/torch/quantization/plugins/huggingface.py", line 624, in _quantized_bmm
batch1 = self.down_proj_input_quantizer(batch1) if self._down_proj_mul else batch1
File "modelopt/torch/quantization/nn/modules/tensor_quantizer.py", line 275, in pre_quant_scale
if not hasattr(self, "_pre_quant_scale") or not self._enable_pre_quant_scale:
RecursionError: maximum recursion depth exceeded
Steps/Code to reproduce bug
- Load a GPT-OSS MoE model using transformers
- Apply QAT using
QATSFTTrainer - Start training - error occurs during first evaluation
Minimal reproduction code:
from modelopt.torch.quantization.plugins import QATSFTTrainer, QuantizationArguments
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import SFTConfig, ScriptArguments, ModelConfig, TrlParser
# Load GPT-OSS MoE model
model = AutoModelForCausalLM.from_pretrained(
"path/to/gpt-oss-moe-model",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
tokenizer = AutoTokenizer.from_pretrained("path/to/gpt-oss-moe-model")
# Setup QAT trainer
trainer = QATSFTTrainer(
model=model,
args=training_args, # SFTConfig with eval_strategy="steps"
train_dataset=dataset["train"],
eval_dataset=dataset["test"],
processing_class=tokenizer,
quant_args=QuantizationArguments(), # default MXFP4 config
)
# This triggers the error during evaluation
trainer.train()Model architecture details:
- The GPT-OSS model uses MoE with
GptOssExpertslayer - The experts layer uses
torch.bmm()for batched matrix multiplication:# transformers/models/gpt_oss/modeling_gpt_oss.py:130 gate_up = torch.bmm(hidden_states, self.gate_up_proj) + self.gate_up_proj_bias[..., None, :]
Expected behavior
QAT training should complete successfully without recursion errors. The _quantized_bmm function should properly quantize torch.bmm operations without creating circular calls.
Who can help?
System information
-
Container used: Custom Ray runtime container with conda environment
-
OS: Linux (RHEL-based, kernel 5.15.x)
-
CPU architecture: x86_64
-
GPU name: NVIDIA H100
-
GPU memory size: 80GB
-
Number of GPUs: 8 (multi-GPU training with DeepSpeed ZeRO-3)
-
Library versions:
- Python: 3.12
- ModelOpt version: latest (installed via
nvidia-modelopt[all]) - CUDA: 12.8
- PyTorch: 2.8.0
- Transformers: 4.57.6
- TRL: 0.27.2
- Accelerate: 1.12.0
- DeepSpeed: latest
- TensorRT-LLM: N/A
- ONNXRuntime: N/A
- TensorRT: N/A
-
Additional details:
- Training framework: Ray + DeepSpeed + Accelerate
- The error occurs at ~28% progress (step 8/29) during the first evaluation
- All 8 ranks fail with the same
RecursionError - Training phase before evaluation completes successfully
Questions
-
Does
nvidia-modeloptofficially support MoE (Mixture of Experts) architectures? If so, are there specific configuration requirements? -
The
_quantized_bmmfunction inhuggingface.pycreates a circular call whentorch._bmmis intercepted. Is this a known issue with MoE models? -
What is the recommended way to exclude specific layers (e.g., MoE experts using
torch.bmm) from quantization while still applying QAT to the rest of the model? -
Are there alternative quantization configurations that avoid quantizing
torch.bmmoperations?
Workarounds Attempted
- Using manual
mtq.quantize()instead ofQATSFTTrainer- Not yet tested - Excluding experts layers from quantization - Not yet tested:
from modelopt.torch.quantization import disable_quantization for name, module in model.named_modules(): if "experts" in name: disable_quantization(module)
Related Files
- Error location:
modelopt/torch/quantization/plugins/huggingface.py:624-626 - Model forward:
transformers/models/gpt_oss/modeling_gpt_oss.py:130