when will you release the code to process the video data?
youngfly11 opened this issue · 3 comments
Thanks for your interest. We plan to release feature extraction code but cannot guarantee a timeline.
If you are in urgent need of extracting the video features in the same format as HERO, you can follow the following repos to build your own feature extraction pipeline:
- SlowFast, we use the pretrained SLOWFAST_8x8_R50 model.
- Image-level features from ResNet-152 following Howto100M.
Thanks,
Linjie
Hi, linjieli;
Thanks for your reply! I have some questions:
- The video clip is processed by sampling a set of visual frames at a fixed rate. I found that the frame rate is 1.5 f/s in the code, am I right? That means if the video clip is 60 s, we just sample 40 frames uniformly from that clip.
- But I found that there are a number of video clips whose frames number excels 40 (maybe 60, 80 or higher), why?
- Do we use the SlowFast Model to process the 40 frames (assuming the video clips is 60s) directly to get their 3D features? But in Slowfast-8x8_R50, the frame sample rate is 8 and 1 respectively, We cannot get the frame level feature because slow path just samples 5 frames (40/8=5). So do you modify the sample rate to be 1 in slow-path? Or use other sampling strategies to the 40 frame feature in from slow-fast network?
Please find the answers to your questions below:
-
As mentioned in Appendix A.5 of our paper, we extract video features at a fixed frame rate (TV: 2/3 frame per second, HowTo100M: 1/2 frame per second). For downstream tasks, you can check
vfeat_interval
in each config to get the corresponding frame rate (frame_rate = 1/vfeat_interval
). For example:
-
As mentioned in Section 4.1 of our paper, we only cut the HowTo videos into 60s-clip. All other videos are kept as their original length. For example, if a TV video is of length 90-second, they you will get a 3D/2D video feature of length 60.
-
We use the original fps in SlowFast to get the 3D video feature. Note that the frame rate mentioned above for example 2/3 frame per second means that we get one frame feature every 1.5 seconds. At a high level, a 1.5-second video clip is fed into SlowFast to get a feature vector. And we repeat this process to get the features for the whole video.
Thanks,
Linjie