when will you release the code to process the video data?

Thanks for your interest. We plan to release feature extraction code but cannot guarantee a timeline.

If you are in urgent need of extracting the video features in the same format as HERO, you can follow the following repos to build your own feature extraction pipeline:

SlowFast, we use the pretrained SLOWFAST_8x8_R50 model.
Image-level features from ResNet-152 following Howto100M.

Thanks,
Linjie

Hi, linjieli;

Thanks for your reply! I have some questions:

The video clip is processed by sampling a set of visual frames at a fixed rate. I found that the frame rate is 1.5 f/s in the code, am I right? That means if the video clip is 60 s, we just sample 40 frames uniformly from that clip.
But I found that there are a number of video clips whose frames number excels 40 (maybe 60, 80 or higher), why?
Do we use the SlowFast Model to process the 40 frames (assuming the video clips is 60s) directly to get their 3D features? But in Slowfast-8x8_R50, the frame sample rate is 8 and 1 respectively, We cannot get the frame level feature because slow path just samples 5 frames (40/8=5). So do you modify the sample rate to be 1 in slow-path? Or use other sampling strategies to the 40 frame feature in from slow-fast network?

Please find the answers to your questions below:

As mentioned in Appendix A.5 of our paper, we extract video features at a fixed frame rate (TV: 2/3 frame per second, HowTo100M: 1/2 frame per second). For downstream tasks, you can check vfeat_interval in each config to get the corresponding frame rate (frame_rate = 1/vfeat_interval). For example:

HERO/config/train-didemo_video_only-4gpu.json

Line 26 in bc4aec5

"vfeat_interval": 1.5,
As mentioned in Section 4.1 of our paper, we only cut the HowTo videos into 60s-clip. All other videos are kept as their original length. For example, if a TV video is of length 90-second, they you will get a 3D/2D video feature of length 60.
We use the original fps in SlowFast to get the 3D video feature. Note that the frame rate mentioned above for example 2/3 frame per second means that we get one frame feature every 1.5 seconds. At a high level, a 1.5-second video clip is fed into SlowFast to get a feature vector. And we repeat this process to get the features for the whole video.

Thanks,
Linjie