-
Notifications
You must be signed in to change notification settings - Fork 15
Description
Very impressive work! When I saw that the understanding encoder is used to compress image information into a “text prompt”, it immediately reminded me of the recent deepseek-ocr idea. Your work is truly inspiring.
I have a few questions I would like to discuss:
The core idea of the paper is that the image understanding encoder itself can serve as an effective image description tool, converting sparse textual descriptions into dense “textual prompts” that better guide image generation. However, as mentioned in Section 2.2, in the text-to-image task, there is no reference image during inference, only the instruction prompt. Meanwhile, according to Equation (3), in RecA training, the instruction prompt is not involved. This leads me to the following concerns:
- Does this introduce a mismatch between training and inference? Or is the output of the image understanding encoder implicitly encoding the original instruction prompt information?
- Traditionally, T2I training maps a short instruction prompt to a detailed image. RecA instead trains using dense understanding prompts to generate detailed images, but during inference it can still generate detailed images from short instruction prompts. Do you think this improvement comes from modifying the internal knowledge of the UMM? Since if an instruction is short, shouldn’t the generated image also be less detailed? Could this be a limitation of current benchmarks?
I may have misunderstandings since I am new to UMMs, and I would greatly appreciate your clarification. Thank you again for the excellent work!