Question regarding to the training objective

1. I'd like to know how can you get the proposed training objective when the derivation (Appendix A.1, Equation 9) is wrong.

<img width="507" height="581" alt="Image" src="https://github.com/user-attachments/assets/002414c3-af86-4079-841b-b820190cae75" />

2. You mentioned that `the t_embedding is the embedding from input timestep` at https://github.com/Fsoft-AIC/LGD/issues/4, but the implementation of the `LGD` in the released code does not perform encoding operation for the timesteps, how can the model condition its features on noise level? As far as I know, the timesteps are usually embeded in the model, like the implementations in the repos of `iddpm` and `guided-diffusion`. Did I misundestand?

3. In your paper, the illustration of the `LGD` implementation indicates that the text–image representation $`z^{*}_{vl} \in \mathbb{R}^{d_{vl}}`$ is fed into an MLP. However, I could not find the corresponding implementation in the released code, where $`z^{*}_{vl}`$ does not appear to be used.

I would be most grateful for any clarification you could provide on these matters.
Thank you in advance for your time and consideration.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question regarding to the training objective #20

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Question regarding to the training objective #20

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions