How did you construct the paired data for text/image variation?
shaform opened this issue · 1 comments
shaform commented
From Figure 3 in the paper, it seems that to train text/image variation you need paired data where input on the left side is the image/caption of the horse while the latent vectors of the VAEs are generated by a different variant of the horse text/image on the right side. However, laion2b-en dataset seems to only contain pairs of image-to-text. How did you construct the variants of those images and texts during training?
xingqian2018 commented
We use laion2B data for both image-to-text and text-to-image. For image variation, we just use the same image.