[INFO] Cause of Severe Multi-GPU Training Slowdown (Transformers v4.57.3)

Before commit [fd2ff6d92d8a9a057693cadc28c1f1098714255f](https://github.com/NVlabs/vla0/commit/fd2ff6d92d8a9a057693cadc28c1f1098714255f), multi-GPU training was about **8-13× slower than single-GPU** (e.g., 1 it/s vs 8–13 s/it).

After profiling, **most of the bottleneck comes from `processor.__call__` inside QwenActor**. Here’s a snippet:

```
Timer unit: 1e-09 s

   703       240      8.6e+10 3.58e+08     66.0                      sysuser_inputs = self.processor(
   704       120      41004.0    341.7      0.0                          text=[sysuser_text],
   705       120      34960.0    291.3      0.0                          images=[sysuser_img],
   706       120      31777.0    264.8      0.0                          return_tensors="pt",
   707       120      28523.0    237.7      0.0                          padding=True,
   708                                                               )
```

This issue only occurs on **transformers v4.57.3** (not v4.51.3). The key difference is that in v4.57.3, the default image processor is `fast`. When multiple processes (per-GPU) call the `fast` image processor simultaneously, it appears to cause extreme slowdowns compared to single-GPU. (not sure why for this point)

**TL;DR:**  
If you’re seeing severe slowdowns in multi-GPU training (8–13×), check if you’re using transformers v4.57.3+ with the `fast` image processor (specifically in QwenActor). Downgrading `transformers` or switching the processor would be immediate remedy, even though I think applying batch-processing pattern into codebase is much better practice.

```diff
    @staticmethod
    def load_qwen_model_processor(
        qwen_model_id,
        min_pixel,
        max_pixel,
        padding_side,
    ):
        if padding_side is not None:
            processor = Qwen2_5_VLProcessor.from_pretrained(
                qwen_model_id,
                min_pixels=min_pixel,
                max_pixels=max_pixel,
                padding_side=padding_side,
            )
+            processor.image_processor = Qwen2VLImageProcessor.from_pretrained(
+                qwen_model_id,
+                use_fast=False,
+            )
        else:
            processor = Qwen2_5_VLProcessor.from_pretrained(
                qwen_model_id,
                min_pixels=min_pixel,
                max_pixels=max_pixel,
            )
+            processor.image_processor = Qwen2VLImageProcessor.from_pretrained(
+                qwen_model_id,
+                use_fast=False,
+            )

        return processor
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[INFO] Cause of Severe Multi-GPU Training Slowdown (Transformers v4.57.3) #27

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[INFO] Cause of Severe Multi-GPU Training Slowdown (Transformers v4.57.3) #27

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions