simon-ging/coot-videotext

Video 100m feature extraction

mireiahernandez opened this issue · 8 comments

Hi,
Thank you for sharing your work and congrats on the paper.

I am trying to extract 100m video features using the video embedding network provided by Miech et al., 2020 (https://github.com/antoine77340/S3D_HowTo100M). In the paper you mention that you sample 0.6 frames/second, however I can't figure out how to obtain the "frame features" using this network. Could you explain it in more detail?
Thank you in advance.

Hi,

So this is what's done by the S3D authors (Miech et al.) and what we are currently doing:

We extract frames at 16 FPS by first cropping of the edges to make the frame square and then resize to 256x256px.
Then we feed a window of 32 frames at a time into the S3D model and save the resulting 512-dim vector as one "frame feature".
We move the window by a stride of 16 and take the next 32, until the video is over.
This results in about 1 FPS of frame features.

For the provided features and our paper we used 10 FPS and 224x224px which results in about 0.6 FPS.

The first approach has slightly better performance on our YouCook2 Video Retrieval experiments.

Best

Thank you for your response!

Best

Hi,

So this is what's done by the S3D authors (Miech et al.) and what we are currently doing:

We extract frames at 16 FPS by first cropping of the edges to make the frame square and then resize to 256x256px. Then we feed a window of 32 frames at a time into the S3D model and save the resulting 512-dim vector as one "frame feature". We move the window by a stride of 16 and take the next 32, until the video is over. This results in about 1 FPS of frame features.

For the provided features and our paper we used 10 FPS and 224x224px which results in about 0.6 FPS.

The first approach has slightly better performance on our YouCook2 Video Retrieval experiments.

Best

Is this the same procedure as in VideoFeatureExtractor or is it different?

The model they are using for extraction is the same (s3d_howto100m.pth), for the parameters/cropping I don't know since I have never used that repository.

Ok. And how are the features stored in the H5 file, I cannot see a script that helps define this either?

Given video_key as str and data as numpy array of shape (num_feature_frames, model_dim):

# open file
h5 = h5py.File("my_file.h5", "w")

# loop over videos to write ...
# write data
h5[video_key] = data

# close file
h5.close()

Added the feature extraction code for Howto100m (S3D) features, see the readme chapter "Running your own video dataset on the trained models".