microsoft/VideoX

Sampling strategy in X-CLIP

MohamedAfham opened this issue · 2 comments

Hi,

Thanks for the amazing work. I have noticed in Table. 11 of the paper that you have done a comparison on sampling methods for X-CLIP (Dense and Sparse sampling). Is there a config in the code which we can play around to change the sampling from sparse to dense. I'm assuming the default sampling in the code is sparse sampling.

Thanks.

nbl97 commented

Thanks for your interest. Yes, the default sampling strategy is sparse sampling. You can change the strategy by modifying this line for training and this line for evaluation. For example, dict(type='SampleFrames', clip_len=config.data.num_segments, frame_interval=64//config.data.num_segments, num_clips=1) means sampling config.data.num_segments frames densely. The data pipeline is borrowed from mmaction2 where you can learn more about the configuration detail.

Hey,

For extracting video features I intend to use xclip-base-patch16-zero-shot as is.
My goal is semantic video search, done with get_text_features.
So I need the text embedding be close to the video embedding in the vector space, and this is how I came across X-CLIP.

I'm wondering about the 32 frames sampling strategy as well.

If I go with sampling like shown here:

indices = sample_frame_indices(clip_len=32, frame_sample_rate=4, seg_len=len(videoreader))

I just get a random slice of 32 consecutive every 4th frame.

This sounds like I can miss a lot of the action and content of the video.

My question:
What do you think about strategy of taking the top-32 key frames by doing key frames extraction and passing them ordered by time to X-CLIP (after the preprocessing ofc)? note there is no fixed time separation between the frames in that strategy.

Do you have another recommended strategy?

Thanks!