microsoft/VideoX

Processing text with specific video

nattikahana opened this issue · 1 comments

Hi,
I read your article of xclip and first of all I would like to say it's fascinating, second I would like to ask about the multi-head self-attention my purpose is to have a database of all video embeddings and then when I search something with text it will search with the only so I can't specify with which video to process it so I would like to know if there is a way to skip the part of multi-head self-attention.
Thanks a lot.

nbl97 commented

Thanks for your interest.
The video-specific prompting is designed for enhancing the text representation that only contains the limited label information. In your project, I think the simplest way is to remove the prompting mechanism, including the multi-head attention and FFN. You may need to remove some related code manually. Pls free feel to ping me if there are further questions.