Is BLIP2 extracting the image description or the video description?
Closed this issue · 1 comments
culeao commented
The official BLIP2 is description extraction for image, but your paper mentions description extraction for videos, can you explain how to do that exactly?
HeliosZhao commented
Hi, Thanks for your interest in our work.
As mentioned in Sec.3.2, we extract the visual feature of each frame with the image encoder of BLIP-2, and then concatenate all the visual tokens and feed them into the Q-Former.
You may also refer to the code in experts/BLIP2/blip_video_model.py