Official dataset for "Multi-modal Segment Assemblage Network for Ad Video Editing with Importance-Coherence Reward" (ACCV 2022) [ACCV'22] [arXiv]
To train and evaluate the models for ad video editing task, we collect 1000+ ad videos from the advertisers to form the Ads-1k dataset. There are 942 ad videos for training and 99 for evaluation in total. However, the annotation methods of the training set and test set are somehow different. Instead of preparing the ground-truth for each data, we annotate each video with multi-labels shown in the supplementary.
Dataset statistics.
Training Set | 13.90 | 2.77 | 34.60 | 30.18 |
Test Set | 18.81 | 1.88 | 34.21 | 35.77 |
Overall | 14.37 | 2.68 | 35.17 | 30.71 |
Besides, the number of annotated segment pairs and the proportion are counted. The number of coherent, incoherent, and uncertain pairs are 6988, 9551, and 2971, occupying 36%, 49%, and 15%, respectively.
Note: The ads are collected from different Chinese advertisers, both large and small. Most of them are in Chinese.
where A is the set of selected segments,
The coherence score given the target duration
where the
The overall score is defined as follows:
The score reflects the ability of trade-off among importance, coherence and total duration.
We have the following dataset files under the data
directory:
-
coh_anno_test.json
: The annotation file of coherence for 99 data in test set. -
data_info.json
: The information of 942 ad videos data for training. -
seg_labels_test.json
: The segments with narrative techniques labels of each video in test set. -
test_info.json
: The duration information of 99 ad videos data for testing -
bert_feats_test.pkl
: The features of text contents (subtitles) extracted by BERT. -
swin_feats_test.pkl
: The features of visual infomation (frames) extracted by Swin-Transformer (Large) from videos for test. -
vggish_feats_test.pkl
: The features of audios extracted by Vggish from videos for test.
We also have the following pre-extracted segment-level features of training data, which can be downloaded from [Google Drive] or [百度网盘(提取码:8gjb)]:
bert_feats_train.pkl
: The features of text contents (subtitles) extracted by BERT from videos for training.swin_feats_train.pkl
: The features of visual infomation (frames) extracted by Swin-Transformer (Large) from videos for training.vggish_feats_train.pkl
: The features of audios extracted by Vggish from videos for training.ppl_maps.pkl
: The PPL maps of training data.
Under the scripts
directory, we include:
-
eval.py
: The evaluation script. Runtest.py
to use it. -
load_ads1k.py
: The data loader for Ads-1k dataset. -
test.py
: run this file to evaluate your results. Replace the nparrayyour_results
in line 5 by your output:... infer = Eval() your_results = ... # replace by your results given_times = [10,15] ...
We thank Southern University of Science and Technology and Tencent for support to the project. This work is supported by the National Natural Science Foundation of China under Grant No. 61972188 and 62122035.
If our work can help you, please feel free to cite it.
@InProceedings{Tang_2022_ACCV,
author = {Tang, Yunlong and Xu, Siting and Wang, Teng and Lin, Qin and Lu, Qinglin and Zheng, Feng},
title = {Multi-modal Segment Assemblage Network for Ad Video Editing with Importance-Coherence Reward},
booktitle = {Proceedings of the Asian Conference on Computer Vision (ACCV)},
month = {December},
year = {2022},
pages = {3519-3535}
}