NVFP4 seems to fail when batch size is greater than 1

Error:

```bash
NotImplementedError: NVFP4Tensor dispatch: attempting to run unimplemented operator/function: func=<OpOverload(op='aten.expand', overload='default')>, types=(<class 'torchao.prototype.mx_formats.nvfp4_tensor.NVFP4Tensor'>,), arg_types=(<class 'torchao.prototype.mx_formats.nvfp4_tensor.NVFP4Tensor'>, <class 'list'>), kwarg_types={}
```

Code:

```py
from diffusers import DiffusionPipeline
import torch

from torchao.quantization import quantize_
from torchao.prototype.mx_formats.inference_workflow import (
     NVFP4DynamicActivationNVFP4WeightConfig,
     NVFP4WeightOnlyConfig,
)

config = NVFP4WeightOnlyConfig(
    use_dynamic_per_tensor_scale=True,
)

pipe = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16
).to("cuda")

quantize_(pipe.transformer, config=config)
pipe.transformer.compile_repeated_blocks(fullgraph=True)

_ = pipe("a dog", num_images_per_prompt=4)
```

Same error happens with `NVFP4DynamicActivationNVFP4WeightConfig` as well.

I am using PyTorch 2.10.0 and nightly TorchAO. I am on B200 with CUDA 12.9.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NVFP4 seems to fail when batch size is greater than 1 #3783

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

NVFP4 seems to fail when batch size is greater than 1 #3783

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions