SHI-Labs/Versatile-Diffusion

How did you construct the paired data for text/image variation?

shaform opened this issue · 1 comments

From Figure 3 in the paper, it seems that to train text/image variation you need paired data where input on the left side is the image/caption of the horse while the latent vectors of the VAEs are generated by a different variant of the horse text/image on the right side. However, laion2b-en dataset seems to only contain pairs of image-to-text. How did you construct the variants of those images and texts during training?

We use laion2B data for both image-to-text and text-to-image. For image variation, we just use the same image.