[Multimodal] make multimodal processing robust#1516
[Multimodal] make multimodal processing robust#1516coding-famer wants to merge 3 commits intoTHUDM:mainfrom
Conversation
slime/utils/processing_utils.py
Outdated
| from qwen_vl_utils import process_vision_info | ||
| # TODO: temporary solution, will write image utils for slime later | ||
| if _qwen_process_vision_info is None: | ||
| raise ImportError("qwen_vl_utils is not installed. Install it with: pip install qwen-vl-utils") |
There was a problem hiding this comment.
hmm... I don't get why we need to move the import to the function above...
| image.save(buffer, format="PNG") | ||
| return base64.b64encode(buffer.getvalue()).decode("utf-8") | ||
| image_base64 = base64.b64encode(buffer.getvalue()).decode("utf-8") | ||
| return f"data:image/png;base64,{image_base64}" |
There was a problem hiding this comment.
shall we move the f"data:image/png;base64,{image_base64}" template into sglang_rollout.py? It seems like a template that is tightly connect to http payload.
There was a problem hiding this comment.
I was thinking about potential future modalities (audio, video, etc.) that may have different MIME types. Keeping the data formatting in each encode functions make sglang_rollout.py doesn't need to handle different MIME types for each modality.
(Although SGLang actually just matches data: and , without parsing the MIME type, but including it makes the format less confusing.)
slime/utils/processing_utils.py
Outdated
| # force return_tensors to None for input_ids | ||
| "return_tensors": None, | ||
| # have been resized by qwen_vl_utils, update this when supporting other models | ||
| "do_resize": False, |
There was a problem hiding this comment.
Removing this for now. Since SGLang re-processes images internally and doesn't expose a do_resize option.
Modifications:
return_tensorsto None and setreturn_tensors='pt'for multimodal inputs.qwen_vl_utilswhen loading processor.