Question about the dimension of image tokens E_i

Question

Question about the dimension of image tokens E_i

edward3862 opened this issue a year ago · 1 comments

Thanks for this interesting work!

I have one minor concern about the dimension of the image tokens E_i. As presented in Sec3.1, the dimension of E_i is (1+L_i)*d, and the dimension of E_t is (1+L_t)*d.

In my understanding, the dimension of the image tokens E_i and text tokens E_t should be different under CLIP ViT-L/14, which is 1024 and 768. During inference, E_i or E_t is injected into the same cross attention layer in the diffusion unet. Then how can we deal with the dimension difference issue?

Please point it out if I have misunderstood. Thank you:)

Answer 1 · 2024-01-22T17:38:24.000Z

Hi,
Thanks for the question. In our framework, we train two models for image-conditioned generation and text-conditioned generation, respectively. Moreover, we concate the conditioning tokens with the shape latents in the UNet-ViT.