Ad-hoc Video Search

We provide frame-level CNN features for the following datasets that have been used by our winning entry for the TRECVID 2018 Ad-hoc Video Search (AVS) task. Code for feature extraction is available at the video-cnn-feat project.

The IACC.3 dataset, which has been the test set for the TRECVID Ad-hoc Video Search (AVS) task since 2016. The dataset contains 4,593 Internet Archive videos (144 GB, 600 hours) with Creative Commons licenses in MPEG-4/H.264 format with duration ranging from 6.5 min to 9.5 min and a mean duration of almost 7.8 min. Automated shot boundary detection has been performed, resulting in 335,944 shots in total. From each shot we sampled frames uniformaly, obtaining 3,845,221 frames in total.
The MSR-VTT dataset, providing 10K web video clips and 200k natural sentences describing the visual content of the clips. The average number of sentences per clip is 20. From each clip we sampled frames uniformly, obtaining 305,462 frames in total.
The TGIF dataset, containing 100K animated GIFs and 120K sentences describing visual content of the animated GIFs. From each gif we sampled frames uniformly, obtaining 1,045,268 frames in total.
The TRECVID 2016 VTT training set, containing 200 videos (Vine urls) and 400 sentences.

Besides, we provide frame-level CNN features for the following datasets that have been used by our winning entry for the TRECVID 2018 Video-to-Text (VTT) Matching and Ranking task.

The Microsoft Video Description dataset (MSVD).

Downloads

Video-level features

Google drive

Frame-level features

CNN feature	Dimensionality	Downloads
ResNext-101	2,048	IACC.3 (27GB), MSR-VTT (2GB), TGIF (7GB), MSVD (288M), TV2016VTT-train (42M)
ResNet-152	2,048	IACC.3 (26GB), MSR-VTT (2GB), TGIF (7GB), MSVD (283M), TV2016VTT-train (42M)

Sentences

Citations

If you find the feature data useful, please consider citing

Xirong Li, Jianfeng Dong, Chaoxi Xu, Jing Cao, Xun Wang, Gang Yang, Renmin University of China and Zhejiang Gongshang University at TRECVID 2018: Deep Cross-Modal Embeddings for Video-Text Retrieval, TRECVID Workshop 2018 [slides]
Xirong Li, Chaoxi Xu, Gang Yang, Zhineng Chen, Jianfeng Dong, W2VV++: Fully Deep Learning for Ad-hoc Video Search, ACM Multimedia 2019

Acknowledgments

We thank the TRECVID team, the MSR-VTT team, the TGIF team, the MSVD team for the datasets, the UvA MediaMill team for sharing their ResNext-101 model, and the MXNet team for sharing their ResNet-152 model.
This project was supported by the National Natural Science Foundation of China (No. 61672523).

li-xirong/avs