Generation inference with interleaved input
Opened this issue · 2 comments
ys-zong commented
Hi, thanks for the nice work! I wonder if Show-o supports inference with interleaved multimodal inputs, e.g., [text 1] [image 1] [text 2] [image 2] [text 3] -> generate a new image. If so, can you provide a code snippet for this? I saw current inference code can only input one image or a pair of image-text. Many thanks!
KebinWu commented
I'm not sure if the code supports doing so, but at least I don't expect the model to perform well on such tasks, as interleaved samples are not used in the training.
Sierkinhane commented
Hi, mixed-modality generation will be released in the future but the timeline is still undetermined.