microsoft/VideoX

Question about sampling frames in XCLIP

HanleiZhang opened this issue · 1 comments

Thank you for your great works! I have recently been working on extracting video features using the XCLIP pre-trained model and have a couple of questions.

Firstly, I have noticed that the model only supports 8, 16, or 32 frames depending on the pre-trained model. However, in real-life scenarios, I need to extract specific keyframes that may not have a fixed number. To address this, I have attempted to modify the source code of the XCLIP implementation in HuggingFace Transformers by feeding the number of frames to relevant classes such as XCLIPVisionEncoderLayer, XCLIPVisionEncoder, and XCLIPVisionTransformer. I have then run the program to obtain video features. Do you think this is a reasonable approach to extracting video features?

Secondly, I have observed that the default number of frames is set to 8, 16, or 32 in the XCLIPMultiframeIntegrationTransformer class to obtain the position embedding, which does not support different number of frames. Therefore, I have omitted the position embedding when extracting video features. Would this significantly affect the results of the video feature extraction process?

nbl97 commented

Thanks for your interest. For the first question, the answer is yes. For the second question, it depends on the type of video. If the video content is highly timed, like push and pull, that can have a big impact. If not, the impact will be much smaller.