-
Notifications
You must be signed in to change notification settings - Fork 22
Description
Before commit fd2ff6d92d8a9a057693cadc28c1f1098714255f, multi-GPU training was about 8-13× slower than single-GPU (e.g., 1 it/s vs 8–13 s/it).
After profiling, most of the bottleneck comes from processor.__call__ inside QwenActor. Here’s a snippet:
Timer unit: 1e-09 s
703 240 8.6e+10 3.58e+08 66.0 sysuser_inputs = self.processor(
704 120 41004.0 341.7 0.0 text=[sysuser_text],
705 120 34960.0 291.3 0.0 images=[sysuser_img],
706 120 31777.0 264.8 0.0 return_tensors="pt",
707 120 28523.0 237.7 0.0 padding=True,
708 )
This issue only occurs on transformers v4.57.3 (not v4.51.3). The key difference is that in v4.57.3, the default image processor is fast. When multiple processes (per-GPU) call the fast image processor simultaneously, it appears to cause extreme slowdowns compared to single-GPU. (not sure why for this point)
TL;DR:
If you’re seeing severe slowdowns in multi-GPU training (8–13×), check if you’re using transformers v4.57.3+ with the fast image processor (specifically in QwenActor). Downgrading transformers or switching the processor would be immediate remedy, even though I think applying batch-processing pattern into codebase is much better practice.
@staticmethod
def load_qwen_model_processor(
qwen_model_id,
min_pixel,
max_pixel,
padding_side,
):
if padding_side is not None:
processor = Qwen2_5_VLProcessor.from_pretrained(
qwen_model_id,
min_pixels=min_pixel,
max_pixels=max_pixel,
padding_side=padding_side,
)
+ processor.image_processor = Qwen2VLImageProcessor.from_pretrained(
+ qwen_model_id,
+ use_fast=False,
+ )
else:
processor = Qwen2_5_VLProcessor.from_pretrained(
qwen_model_id,
min_pixels=min_pixel,
max_pixels=max_pixel,
)
+ processor.image_processor = Qwen2VLImageProcessor.from_pretrained(
+ qwen_model_id,
+ use_fast=False,
+ )
return processor