Skip to content

Question about the mismatch between training and inference prompts #10

@lky-violet

Description

@lky-violet

Very impressive work! When I saw that the understanding encoder is used to compress image information into a “text prompt”, it immediately reminded me of the recent deepseek-ocr idea. Your work is truly inspiring.

I have a few questions I would like to discuss:

The core idea of the paper is that the image understanding encoder itself can serve as an effective image description tool, converting sparse textual descriptions into dense “textual prompts” that better guide image generation. However, as mentioned in Section 2.2, in the text-to-image task, there is no reference image during inference, only the instruction prompt. Meanwhile, according to Equation (3), in RecA training, the instruction prompt is not involved. This leads me to the following concerns:

  1. Does this introduce a mismatch between training and inference? Or is the output of the image understanding encoder implicitly encoding the original instruction prompt information?
  2. Traditionally, T2I training maps a short instruction prompt to a detailed image. RecA instead trains using dense understanding prompts to generate detailed images, but during inference it can still generate detailed images from short instruction prompts. Do you think this improvement comes from modifying the internal knowledge of the UMM? Since if an instruction is short, shouldn’t the generated image also be less detailed? Could this be a limitation of current benchmarks?

I may have misunderstandings since I am new to UMMs, and I would greatly appreciate your clarification. Thank you again for the excellent work!

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions