Error: GPT-OSS-20B QAT failed with "RecursionError: maximum recursion depth exceeded"

# GitHub Issue Template for NVIDIA Model-Optimizer

## Describe the bug
From the "https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/gpt-oss" code, we try to utilize QAT after training with SFT.

When using `QATSFTTrainer` with a GPT-OSS MoE (Mixture of Experts) model, training fails with a `RecursionError: maximum recursion depth exceeded` during the evaluation phase.

The issue occurs in `_quantized_bmm()` function where `torch._bmm` is intercepted and redirected back to `_quantized_bmm`, causing infinite recursion (954+ repeated calls).

**Impact**: Blocker - Cannot perform QAT training on MoE architecture models.

**Stack Trace** (abbreviated):
```
RecursionError: maximum recursion depth exceeded

File "modelopt/torch/quantization/plugins/huggingface.py", line 626, in _quantized_bmm
    return torch._bmm(batch1, batch2)
  [Previous line repeated 954 more times]

File "modelopt/torch/quantization/plugins/huggingface.py", line 624, in _quantized_bmm
    batch1 = self.down_proj_input_quantizer(batch1) if self._down_proj_mul else batch1

File "modelopt/torch/quantization/nn/modules/tensor_quantizer.py", line 275, in pre_quant_scale
    if not hasattr(self, "_pre_quant_scale") or not self._enable_pre_quant_scale:

RecursionError: maximum recursion depth exceeded
```

### Steps/Code to reproduce bug

1. Load a GPT-OSS MoE model using transformers
2. Apply QAT using `QATSFTTrainer`
3. Start training - error occurs during first evaluation

**Minimal reproduction code**:
```python
from modelopt.torch.quantization.plugins import QATSFTTrainer, QuantizationArguments
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import SFTConfig, ScriptArguments, ModelConfig, TrlParser

# Load GPT-OSS MoE model
model = AutoModelForCausalLM.from_pretrained(
    "path/to/gpt-oss-moe-model",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)
tokenizer = AutoTokenizer.from_pretrained("path/to/gpt-oss-moe-model")

# Setup QAT trainer
trainer = QATSFTTrainer(
    model=model,
    args=training_args,  # SFTConfig with eval_strategy="steps"
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    processing_class=tokenizer,
    quant_args=QuantizationArguments(),  # default MXFP4 config
)

# This triggers the error during evaluation
trainer.train()
```

**Model architecture details**:
- The GPT-OSS model uses MoE with `GptOssExperts` layer
- The experts layer uses `torch.bmm()` for batched matrix multiplication:
  ```python
  # transformers/models/gpt_oss/modeling_gpt_oss.py:130
  gate_up = torch.bmm(hidden_states, self.gate_up_proj) + self.gate_up_proj_bias[..., None, :]
  ```

### Expected behavior

QAT training should complete successfully without recursion errors. The `_quantized_bmm` function should properly quantize `torch.bmm` operations without creating circular calls.

### Who can help?



- @kevalmorabia97 

---

## System information

- **Container used**: Custom Ray runtime container with conda environment
- **OS**: Linux (RHEL-based, kernel 5.15.x)
- **CPU architecture**: x86_64
- **GPU name**: NVIDIA H100
- **GPU memory size**: 80GB
- **Number of GPUs**: 8 (multi-GPU training with DeepSpeed ZeRO-3)
- **Library versions**:
  - Python: 3.12
  - ModelOpt version: latest (installed via `nvidia-modelopt[all]`)
  - CUDA: 12.8
  - PyTorch: 2.8.0
  - Transformers: 4.57.6
  - TRL: 0.27.2
  - Accelerate: 1.12.0
  - DeepSpeed: latest
  - TensorRT-LLM: N/A
  - ONNXRuntime: N/A
  - TensorRT: N/A

- **Additional details**:
  - Training framework: Ray + DeepSpeed + Accelerate
  - The error occurs at ~28% progress (step 8/29) during the first evaluation
  - All 8 ranks fail with the same `RecursionError`
  - Training phase before evaluation completes successfully

---

## Questions

1. Does `nvidia-modelopt` officially support MoE (Mixture of Experts) architectures? If so, are there specific configuration requirements?

2. The `_quantized_bmm` function in `huggingface.py` creates a circular call when `torch._bmm` is intercepted. Is this a known issue with MoE models?

3. What is the recommended way to exclude specific layers (e.g., MoE experts using `torch.bmm`) from quantization while still applying QAT to the rest of the model?

4. Are there alternative quantization configurations that avoid quantizing `torch.bmm` operations?

---

## Workarounds Attempted

1. **Using manual `mtq.quantize()` instead of `QATSFTTrainer`** - Not yet tested
2. **Excluding experts layers from quantization** - Not yet tested:
   ```python
   from modelopt.torch.quantization import disable_quantization
   for name, module in model.named_modules():
       if "experts" in name:
           disable_quantization(module)
   ```

---

## Related Files

- Error location: `modelopt/torch/quantization/plugins/huggingface.py:624-626`
- Model forward: `transformers/models/gpt_oss/modeling_gpt_oss.py:130`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error: GPT-OSS-20B QAT failed with "RecursionError: maximum recursion depth exceeded" #857

GitHub Issue Template for NVIDIA Model-Optimizer

Describe the bug

Steps/Code to reproduce bug

Expected behavior

Who can help?

System information

Questions

Workarounds Attempted

Related Files

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Error: GPT-OSS-20B QAT failed with "RecursionError: maximum recursion depth exceeded" #857

Description

GitHub Issue Template for NVIDIA Model-Optimizer

Describe the bug

Steps/Code to reproduce bug

Expected behavior

Who can help?

System information

Questions

Workarounds Attempted

Related Files

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions