How to extract features using CLIP VIT-L?

Question

How to extract features using CLIP VIT-L?

Closed this issue 7 months ago · 2 comments

I used CLIP to extract video features on my own dataset, but qav_loss did not decrease at all. I tried the provided features on the Next-QA dataset and found the qav_loss can converge in 5 epochs. I don't know why...
I would like to ask if you can provide the code of the feature extraction part, thank you very much...

Answer 1 · 2023-11-06T01:37:22.000Z

Thank you for your interests in our work.
We extracted CLIP feature using the code from FrozenBiLM.
You can refer to 'Video Feature Extraction' part from the above repository.
Also, if your own dataset does not have enough temporal dynamics in the video, qav_loss may not decrease.
If you have any questions, please let me know.

Answer 2 · 2023-11-06T01:42:23.000Z

Thank you very much for your reply! I will try it.