Question about the mismatch between training and inference prompts

Very impressive work! When I saw that the understanding encoder is used to compress image information into a “text prompt”, it immediately reminded me of the recent deepseek-ocr idea. Your work is truly inspiring.

I have a few questions I would like to discuss:

The core idea of the paper is that the **image understanding encoder** itself can serve as an **effective image description tool**, converting **sparse textual descriptions** into **dense “textual prompts”** that better guide image generation. However, as mentioned in Section 2.2, in the text-to-image task, there is no reference image during inference, only the instruction prompt. Meanwhile, according to Equation (3), in RecA training, the instruction prompt is not involved. This leads me to the following concerns:

1. Does this introduce a mismatch between training and inference? Or is the output of the image understanding encoder implicitly encoding the original instruction prompt information?  
2. Traditionally, T2I training maps a short instruction prompt to a detailed image. RecA instead trains using dense understanding prompts to generate detailed images, but during inference it can still generate detailed images from short instruction prompts. Do you think this improvement comes from modifying the internal knowledge of the UMM? Since if an instruction is short, shouldn’t the generated image also be less detailed? Could this be a limitation of current benchmarks?  

I may have misunderstandings since I am new to UMMs, and I would greatly appreciate your clarification. Thank you again for the excellent work!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about the mismatch between training and inference prompts #10

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Question about the mismatch between training and inference prompts #10

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions