Image generation with the multi-modal prompt

Hi @jy0205,

Thank you for your excellent work! I would appreciate it if you could provide more details about image generation with multi-modal prompts. I noticed that you mentioned in Appendix A:

Given a multi-modal prompt (image, text, or their combinations), LaVIT first tokenizes it as a sequence of discrete tokens.

So if the prompt includes both image and text, do you tokenize them all into discrete tokens? I am a bit confused because in Section 3.2 you stated:

To empower LaVIT with the capability to generate both text and images, we employ two different concatenation forms, i.e., [image, text] and [text; image]. When the image is used as a condition (on the left) to generate text, we use the continuous visual features X_r from the token merger instead of quantized visual embeddings as the input to LLMs.

Therefore, when the image is included in the prompt, do you use the continuous feature?

Or do you actually use continuous or discrete features based on the type of task — for instance, using continuous features for understanding (text generation) and discrete token features for image generation?

Thanks again!

Yes, we use continuous features only for understanding tasks (text generation). For all the image generation, the prompts (includes image and text) are tokenized into discrete tokens. You can refer to the code

LaVIT/LaVIT/models/lavit_for_generation.py

Line 518 in 228e391

def multimodal_synthesis(self, prompts,

for more details.

I really appreciate your response, it was of great help to me.