Skip to content

[INFO] Cause of Severe Multi-GPU Training Slowdown (Transformers v4.57.3) #27

@MilkClouds

Description

@MilkClouds

Before commit fd2ff6d92d8a9a057693cadc28c1f1098714255f, multi-GPU training was about 8-13× slower than single-GPU (e.g., 1 it/s vs 8–13 s/it).

After profiling, most of the bottleneck comes from processor.__call__ inside QwenActor. Here’s a snippet:

Timer unit: 1e-09 s

   703       240      8.6e+10 3.58e+08     66.0                      sysuser_inputs = self.processor(
   704       120      41004.0    341.7      0.0                          text=[sysuser_text],
   705       120      34960.0    291.3      0.0                          images=[sysuser_img],
   706       120      31777.0    264.8      0.0                          return_tensors="pt",
   707       120      28523.0    237.7      0.0                          padding=True,
   708                                                               )

This issue only occurs on transformers v4.57.3 (not v4.51.3). The key difference is that in v4.57.3, the default image processor is fast. When multiple processes (per-GPU) call the fast image processor simultaneously, it appears to cause extreme slowdowns compared to single-GPU. (not sure why for this point)

TL;DR:
If you’re seeing severe slowdowns in multi-GPU training (8–13×), check if you’re using transformers v4.57.3+ with the fast image processor (specifically in QwenActor). Downgrading transformers or switching the processor would be immediate remedy, even though I think applying batch-processing pattern into codebase is much better practice.

    @staticmethod
    def load_qwen_model_processor(
        qwen_model_id,
        min_pixel,
        max_pixel,
        padding_side,
    ):
        if padding_side is not None:
            processor = Qwen2_5_VLProcessor.from_pretrained(
                qwen_model_id,
                min_pixels=min_pixel,
                max_pixels=max_pixel,
                padding_side=padding_side,
            )
+            processor.image_processor = Qwen2VLImageProcessor.from_pretrained(
+                qwen_model_id,
+                use_fast=False,
+            )
        else:
            processor = Qwen2_5_VLProcessor.from_pretrained(
                qwen_model_id,
                min_pixels=min_pixel,
                max_pixels=max_pixel,
            )
+            processor.image_processor = Qwen2VLImageProcessor.from_pretrained(
+                qwen_model_id,
+                use_fast=False,
+            )

        return processor

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions