Skip to content

Error: GPT-OSS-20B QAT failed with "RecursionError: maximum recursion depth exceeded" #857

@tjdhg456

Description

@tjdhg456

GitHub Issue Template for NVIDIA Model-Optimizer

Describe the bug

From the "https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/gpt-oss" code, we try to utilize QAT after training with SFT.

When using QATSFTTrainer with a GPT-OSS MoE (Mixture of Experts) model, training fails with a RecursionError: maximum recursion depth exceeded during the evaluation phase.

The issue occurs in _quantized_bmm() function where torch._bmm is intercepted and redirected back to _quantized_bmm, causing infinite recursion (954+ repeated calls).

Impact: Blocker - Cannot perform QAT training on MoE architecture models.

Stack Trace (abbreviated):

RecursionError: maximum recursion depth exceeded

File "modelopt/torch/quantization/plugins/huggingface.py", line 626, in _quantized_bmm
    return torch._bmm(batch1, batch2)
  [Previous line repeated 954 more times]

File "modelopt/torch/quantization/plugins/huggingface.py", line 624, in _quantized_bmm
    batch1 = self.down_proj_input_quantizer(batch1) if self._down_proj_mul else batch1

File "modelopt/torch/quantization/nn/modules/tensor_quantizer.py", line 275, in pre_quant_scale
    if not hasattr(self, "_pre_quant_scale") or not self._enable_pre_quant_scale:

RecursionError: maximum recursion depth exceeded

Steps/Code to reproduce bug

  1. Load a GPT-OSS MoE model using transformers
  2. Apply QAT using QATSFTTrainer
  3. Start training - error occurs during first evaluation

Minimal reproduction code:

from modelopt.torch.quantization.plugins import QATSFTTrainer, QuantizationArguments
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import SFTConfig, ScriptArguments, ModelConfig, TrlParser

# Load GPT-OSS MoE model
model = AutoModelForCausalLM.from_pretrained(
    "path/to/gpt-oss-moe-model",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)
tokenizer = AutoTokenizer.from_pretrained("path/to/gpt-oss-moe-model")

# Setup QAT trainer
trainer = QATSFTTrainer(
    model=model,
    args=training_args,  # SFTConfig with eval_strategy="steps"
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    processing_class=tokenizer,
    quant_args=QuantizationArguments(),  # default MXFP4 config
)

# This triggers the error during evaluation
trainer.train()

Model architecture details:

  • The GPT-OSS model uses MoE with GptOssExperts layer
  • The experts layer uses torch.bmm() for batched matrix multiplication:
    # transformers/models/gpt_oss/modeling_gpt_oss.py:130
    gate_up = torch.bmm(hidden_states, self.gate_up_proj) + self.gate_up_proj_bias[..., None, :]

Expected behavior

QAT training should complete successfully without recursion errors. The _quantized_bmm function should properly quantize torch.bmm operations without creating circular calls.

Who can help?


System information

  • Container used: Custom Ray runtime container with conda environment

  • OS: Linux (RHEL-based, kernel 5.15.x)

  • CPU architecture: x86_64

  • GPU name: NVIDIA H100

  • GPU memory size: 80GB

  • Number of GPUs: 8 (multi-GPU training with DeepSpeed ZeRO-3)

  • Library versions:

    • Python: 3.12
    • ModelOpt version: latest (installed via nvidia-modelopt[all])
    • CUDA: 12.8
    • PyTorch: 2.8.0
    • Transformers: 4.57.6
    • TRL: 0.27.2
    • Accelerate: 1.12.0
    • DeepSpeed: latest
    • TensorRT-LLM: N/A
    • ONNXRuntime: N/A
    • TensorRT: N/A
  • Additional details:

    • Training framework: Ray + DeepSpeed + Accelerate
    • The error occurs at ~28% progress (step 8/29) during the first evaluation
    • All 8 ranks fail with the same RecursionError
    • Training phase before evaluation completes successfully

Questions

  1. Does nvidia-modelopt officially support MoE (Mixture of Experts) architectures? If so, are there specific configuration requirements?

  2. The _quantized_bmm function in huggingface.py creates a circular call when torch._bmm is intercepted. Is this a known issue with MoE models?

  3. What is the recommended way to exclude specific layers (e.g., MoE experts using torch.bmm) from quantization while still applying QAT to the rest of the model?

  4. Are there alternative quantization configurations that avoid quantizing torch.bmm operations?


Workarounds Attempted

  1. Using manual mtq.quantize() instead of QATSFTTrainer - Not yet tested
  2. Excluding experts layers from quantization - Not yet tested:
    from modelopt.torch.quantization import disable_quantization
    for name, module in model.named_modules():
        if "experts" in name:
            disable_quantization(module)

Related Files

  • Error location: modelopt/torch/quantization/plugins/huggingface.py:624-626
  • Model forward: transformers/models/gpt_oss/modeling_gpt_oss.py:130

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions