showlab/Show-o

Generation inference with interleaved input

Opened this issue · 2 comments

Hi, thanks for the nice work! I wonder if Show-o supports inference with interleaved multimodal inputs, e.g., [text 1] [image 1] [text 2] [image 2] [text 3] -> generate a new image. If so, can you provide a code snippet for this? I saw current inference code can only input one image or a pair of image-text. Many thanks!

I'm not sure if the code supports doing so, but at least I don't expect the model to perform well on such tasks, as interleaved samples are not used in the training.

Hi, mixed-modality generation will be released in the future but the timeline is still undetermined.