linjieli222/HERO

when will you release the code to process the video data?

youngfly11 opened this issue · 3 comments

when will you release the code to process the video data?

Thanks for your interest. We plan to release feature extraction code but cannot guarantee a timeline.

If you are in urgent need of extracting the video features in the same format as HERO, you can follow the following repos to build your own feature extraction pipeline:

  1. SlowFast, we use the pretrained SLOWFAST_8x8_R50 model.
  2. Image-level features from ResNet-152 following Howto100M.

Thanks,
Linjie

Hi, linjieli;

Thanks for your reply! I have some questions:

  • The video clip is processed by sampling a set of visual frames at a fixed rate. I found that the frame rate is 1.5 f/s in the code, am I right? That means if the video clip is 60 s, we just sample 40 frames uniformly from that clip.
  • But I found that there are a number of video clips whose frames number excels 40 (maybe 60, 80 or higher), why?
  • Do we use the SlowFast Model to process the 40 frames (assuming the video clips is 60s) directly to get their 3D features? But in Slowfast-8x8_R50, the frame sample rate is 8 and 1 respectively, We cannot get the frame level feature because slow path just samples 5 frames (40/8=5). So do you modify the sample rate to be 1 in slow-path? Or use other sampling strategies to the 40 frame feature in from slow-fast network?

Please find the answers to your questions below:

  1. As mentioned in Appendix A.5 of our paper, we extract video features at a fixed frame rate (TV: 2/3 frame per second, HowTo100M: 1/2 frame per second). For downstream tasks, you can check vfeat_interval in each config to get the corresponding frame rate (frame_rate = 1/vfeat_interval). For example:

    "vfeat_interval": 1.5,

  2. As mentioned in Section 4.1 of our paper, we only cut the HowTo videos into 60s-clip. All other videos are kept as their original length. For example, if a TV video is of length 90-second, they you will get a 3D/2D video feature of length 60.

  3. We use the original fps in SlowFast to get the 3D video feature. Note that the frame rate mentioned above for example 2/3 frame per second means that we get one frame feature every 1.5 seconds. At a high level, a 1.5-second video clip is fed into SlowFast to get a feature vector. And we repeat this process to get the features for the whole video.

Thanks,
Linjie