Skip to content

[Multimodal] make multimodal processing robust#1516

Open
coding-famer wants to merge 3 commits intoTHUDM:mainfrom
coding-famer:fix/mm_process
Open

[Multimodal] make multimodal processing robust#1516
coding-famer wants to merge 3 commits intoTHUDM:mainfrom
coding-famer:fix/mm_process

Conversation

@coding-famer
Copy link
Contributor

Modifications:

  1. Use explicit base64 encoding.
  2. Force return_tensors to None and set return_tensors='pt' for multimodal inputs.
  3. Lazy import qwen_vl_utils when loading processor.

from qwen_vl_utils import process_vision_info
# TODO: temporary solution, will write image utils for slime later
if _qwen_process_vision_info is None:
raise ImportError("qwen_vl_utils is not installed. Install it with: pip install qwen-vl-utils")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm... I don't get why we need to move the import to the function above...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

image.save(buffer, format="PNG")
return base64.b64encode(buffer.getvalue()).decode("utf-8")
image_base64 = base64.b64encode(buffer.getvalue()).decode("utf-8")
return f"data:image/png;base64,{image_base64}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we move the f"data:image/png;base64,{image_base64}" template into sglang_rollout.py? It seems like a template that is tightly connect to http payload.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about potential future modalities (audio, video, etc.) that may have different MIME types. Keeping the data formatting in each encode functions make sglang_rollout.py doesn't need to handle different MIME types for each modality.
(Although SGLang actually just matches data: and , without parsing the MIME type, but including it makes the format less confusing.)

# force return_tensors to None for input_ids
"return_tensors": None,
# have been resized by qwen_vl_utils, update this when supporting other models
"do_resize": False,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing this for now. Since SGLang re-processes images internally and doesn't expose a do_resize option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants