Skip to content

Question regarding to the training objective #20

@ZDDDYuan

Description

@ZDDDYuan
  1. I'd like to know how can you get the proposed training objective when the derivation (Appendix A.1, Equation 9) is wrong.
Image
  1. You mentioned that the t_embedding is the embedding from input timestep at Consults in details of training LGD model in Grasp-anything++ database. #4, but the implementation of the LGD in the released code does not perform encoding operation for the timesteps, how can the model condition its features on noise level? As far as I know, the timesteps are usually embeded in the model, like the implementations in the repos of iddpm and guided-diffusion. Did I misundestand?

  2. In your paper, the illustration of the LGD implementation indicates that the text–image representation $z^{*}_{vl} \in \mathbb{R}^{d_{vl}}$ is fed into an MLP. However, I could not find the corresponding implementation in the released code, where $z^{*}_{vl}$ does not appear to be used.

I would be most grateful for any clarification you could provide on these matters.
Thank you in advance for your time and consideration.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions