mlvlab/Flipped-VQA

How to extract features using CLIP VIT-L?

Closed this issue · 2 comments

yuanrr commented

I used CLIP to extract video features on my own dataset, but qav_loss did not decrease at all. I tried the provided features on the Next-QA dataset and found the qav_loss can converge in 5 epochs. I don't know why...
I would like to ask if you can provide the code of the feature extraction part, thank you very much...

ikodoh commented

Thank you for your interests in our work.
We extracted CLIP feature using the code from FrozenBiLM.
You can refer to 'Video Feature Extraction' part from the above repository.
Also, if your own dataset does not have enough temporal dynamics in the video, qav_loss may not decrease.
If you have any questions, please let me know.

yuanrr commented

Thank you very much for your reply! I will try it.