HeliosZhao/Make-A-Protagonist

Is BLIP2 extracting the image description or the video description?

Closed this issue · 1 comments

The official BLIP2 is description extraction for image, but your paper mentions description extraction for videos, can you explain how to do that exactly?

Hi, Thanks for your interest in our work.

As mentioned in Sec.3.2, we extract the visual feature of each frame with the image encoder of BLIP-2, and then concatenate all the visual tokens and feed them into the Q-Former.

You may also refer to the code in experts/BLIP2/blip_video_model.py