v-iashin/MDVC

I3D Convolutions Script + Input Data

amanchadha opened this issue ยท 6 comments

Hi Vladimir,

Noticed in the MDVC codebase that you load the I3D CONV features from "./data/sub_activitynet_v1-3.i3d_25fps_stack24step24_2stream.hdf5"

Some questions:
(i) Do you have a script that generates these features from raw data?
(ii) What input data did you run the I3D model over? I ask because it appears from your I3D features filename that your features are for 25 FPS, which implies that you manually sampled the videos in the ActivityNet Captions dataset at 25 FPS since unfortunately, the official ActivityNet website only offers sampled frames at 5 FPS (http://activity-net.org/challenges/2020/tasks/anet_captioning.html).
(iii) Do you have a link for the sampled frames?

Thanks!
Aman

Hi,

Sorry for the long reply. I decided to write a little library dedicated to feature extraction from videos. It is mainly based on the script I wrote for the MDVC but is more transparent and easier to use ๐Ÿ™‚. So, it took a couple of days to wrap it up. Check it out: https://github.com/v-iashin/i3d_features.

Here are the answers to your questions:
(i) Yes. Check out v-iashin/video_features@4fa02bd5c. Please see the notes below.

(ii) Yep. Exactly! We downloaded the available videos using the official script activitynet/ActivityNet@7185a39 and run the feature extraction script over the raw videos.

(iii) I still have the videos. I can think about a way how to share them in case you would REALLY like to have them ๐Ÿ™‚.

The notes on (i):
I was using an implementation of PWC Net from sniklaus/pytorch-pwc@f613890 with a couple of tweaks. Yesterday, I checked and figured out that the model weights have been changed (hashes: 91006e6cd54dc052b00660239f5b1814 -> 08330ee36a9aa0d16f198f8927352502). I am not sure what caused it, I haven't contacted the author. I tried both and there is a small difference between the values. So I provide the model, which I used for MDVC (network-default.pytorch) there as well as the weights from the latest model from sniklaus/pytorch-pwc (pwc_net.pt). Make sure to use the correct ones:

python main.py --feature_type i3d --device_ids 0 --extraction_fps 25 --stack_size 24 --step_size 24 --pwc_path ./models/i3d/checkpoints/network-default.pytorch --video_paths ./sample/v_ZNVhz7ctTq0.mp4
# this outputs the exact values as in "sub_activitynet_v1-3.i3d_25fps_stack24step24_2stream.hdf5" for this video

Another note is regarding how the I3D features were extracted for ActivityNet. Specifically, please see i3d.utils.utils.form_iter_list() function. It has phase argument. This argument was specified according to the epoch phase: train or val_1/val_2 and how the last video frames were used to form the last video feature in a video. Please, make sure to tweak the code of the feature extraction a bit. I think it should be pretty straight forward. I just wanted to keep it dataset-independent and decided to work around it somehow.

Hi Vladimir,

(i) Thanks for putting together the I3D repository. Got me clarity on the process followed to get to the I3D features. Indeed very helpful.

(ii) I see. When you ran the feature extraction script on the videos, did you store the sampled frames (using --keep_frames)? Sadly even the official 5 FPS link for videos on (http://activity-net.org/challenges/2020/tasks/anet_captioning.html) isn't accessible. I am currently blocked on making any progress in my work, so it is necessary to gain access. If you have access to the sampled frames and can upload them, I would really appreciate it. If you need a server to upload, I can arrange one for you.

Thanks again,
Aman

(ii) I am afraid I cannot provide you with frames as we didn't store them at all. The videos itself are 200+ GB, but the frames were 1+ TB. We didn't have a larger fast disk (SSD or NVMe) to read them on-fly. So, we decided to calculate features and remove the frames right away, just like in the repo now. I can upload the videos, and you can extract features along with frames. Let me know if you need the videos. The process of extraction takes around a week on this dataset for 24 step and 24 stack size on 3 2080Ti GPUs.

We have the resources, I will organize the download link. Don't worry.

Ok, it would be well appreciated if you can share the videos. Thanks!

Please contact us via e-mail.

Thank you!