sshh12/multi_token

Summarize video

Closed this issue · 1 comments

If my multi is continuous images from like screenshots, what should be my prompt when I use Mistral-7B-LoRA-Multi-VisionCLIPPool-LLAVA

This is the format I used:

<image><image><image> What is happening in these frames?

Although not sure how well it work given my training data was mainly compare/contrast rather than video understanding.

It's only trained for up to 6 images, may work for more though.