The issue of the input of caption

Question

The issue of the input of caption

liuxuannan opened this issue a year ago · 0 comments

I have a question about the position format of the caption in the input data in the command data. For example, the following sentence in the paper, A video of a Super-hero Movie. Is this sentence part of the text prompt, or does it need to be embedded through the imagebind model and then input into LLM?