This repo contains downloading instructions for the OpenViDial dataset in 《OpenViDial: A Large-Scale, Open-Domain Dialogue Dataset with Visual Contexts》 along with the code to reproduce results in the paper (See Section Baselines).
When humans converse, what a speaker will say next significantly depends on what he sees. OpenViDial is a largescale multi-module dialogue dataset for this purpose. The dialogue turns and visual contexts are extracted from movies and TV series, where each dialogue turn is paired with the corresponding visual context in which it takes place. OpenViDial contains a total number of 1.1 million dialogue turns, and thus 1.1 million visual contexts stored in images.
The following are two short conversations where visual contexts are crucial.
Attribute | value |
---|---|
Number of turns | 1.1M |
Number of images | 1.1M |
Vocab size before BPE | 70K |
Vocab size after BPE | 30K |
Average length of each episode | 14 |
Average length of each turn | 7.6 |
***** New February 27th, 2021: New split of Dataset *****
We upload a new version(1.1) of our dataset, which adopt the train/valid/test split of dataset consistent with our paper(1M, 50k, 50K), while the older split is (900K, 100k, 100k). The download urls are updated too. Note: cnn/rcnn features of valid/test dataset are incorrect now, and we will update it soon.
The main folder origin_dir
contains training/valid/test sets, each of which is made up by the following files:
├──origin_dir
└── train.dialogue.jsonl // each line is an episode of dialogue, which a list of IDs.
└── train.origin.txt // each line corresponds to a dialogue text utterence, with the ID being its line number (staring with 0).
└── train_images // containing images (visual contexts) in which the text utterence take place, with ID being the image filename (0,1,2, etc)
└── 0.jpg
└── 1.jpg
└── ...
└── valid.* (i.e., valid.dialogue.jsonl, valid.origin.txt, valid_images)
└── test.* (i.e., test.dialogue.jsonl, test.origin.txt, test_images)
If you'd like to take a glance at the a sample of the dataset instead of downloading the full dataset, we provide a data sample here
Data download:
- Download
[train|valid|test].origin.txt
and[train|valid|test].dialogue.jsonl
here - Download
test_images
(~ 20G) here - Download
valid_images
(~ 20G) here - Download train_images: Since train_images is too big (~ 170G), we split it to 12 zip files. Download seperate files
zip_train
here. Then download and runcat.sh
here to include all files in the same directory. - Move all files to
origin_dir
.
We proposed three models for this dataset. Please refer to the paper for details:
- Model #1 - NoVisual: use only dialog texts without visual information
- Model #2 - CoarseVisual: use texts and a pretrained ResNet50 on ImageNet to compute 1000-d feature from each picture
- Model #3 - FineVisual: use texts and a pretrained Faster R-CNN on Genome to compute 2048-d * K objects features from each picture
Faster R-CNN is an object detection framework. The detection sample and attention over objects during text decoding is shown below.
- python >= 3.6
pip install -r requirements.txt
preprocessed_data_dir is a directory that contains all the preprocessed files (text, image feature mmap, offsets, etc.) generated from origin_data_dir and we use them in training models. The directory structure is shown below.
Note: every train*
file or directory should have a 'valid' and a 'test' counterpart, we ignore them below for simplicity.
├──preprocessed_data_dir
└── train.features.mmap // numpy mmap array file of shape [num_sents, 1000], each row is a 1000-d ResNet-50 feature
└── train.objects.mmap // numpy mmap array file of shape [num_sents, 20, 2048], faster-rcnn object feature file, each row contain 20 objects feature, which is 2048-d
└── train.objects_mask.mmap // numpy mmap array file of shape [num_sents, 20], faster-rcnn mask file, each row contain 20 objects mask, 1 for valid, 0 for mask
└── train.offsets.npy // numpy array file of shape [num_episodes], each item is the offsets of one episode
└── train.sent_num.npy // numpy array file of shape [num_episodes], each item is the sentence number of one episode
We use Moses Tokenizer to tokenize texts and generate offsets arrays:
bash ./scripts/preprocess_video_data.sh
and followed with byte-pair-encoding and fairseq-preprocess binarization:
bash ./scripts/preprocess_text_data.sh
Note: You need to change DATA_DIR, ORIGIN_DIR, OUTPUT_DIR
to your own path
Preprocessed ResNet50 features (*.features.mmap
)
(~4G) can be downloaded from here
and move them under preprocessed_data_dir/
Preprocessed Faster R-CNN objects features (*objects.mmap
, *objects_mask.mmap
)
(~160G) can be downloaded from here
then move them under preprocessed_data_dir/
Since file train.objects.mmap
is too large(100G+),
we splitted it to many small pieces like train.objects.mmap.split*
,
and you need another step to merge all those files together: cat train.objects.mmap.split* >train.objects.mmap
If you want to extract some feature on your own, or you'd like to know details of extracting visual features, see video_dialogue_model/extract_features/extract_features.md
bash scripts/reproduce_baselines/text_only.sh
will train and evaluate NoVisual,
Remember to change MODEL_DIR
and DATA_DIR
for your setup.
Note: fairseq
may use all gpus on your machine and the actual batch size is times by number of gpus.
Therefore, if you use multiple gpus, batch size should be devided by number of gpus.
bash scripts/reproduce_baselines/text_and_img_feature.sh
will train and evaluate CoarseVisual.
Remember to change MODEL_DIR
and DATA_DIR
for your setup. Please make sure you use one single gpu to reproduce our results.
bash scripts/reproduce_baselines/text_and_img_objects.sh
will train and evaluate FineVisual,
Remember to change MODEL_DIR
and DATA_DIR
for your setup. Please make sure you use one single gpu to reproduce our results.
- get length/diversity/stopwords% statistics of system output:
train/stats.py
Model | BLEU-1 | BLEU-2 | BLEU-4 | Stopword% | Dis-11 | Dis-2 | Dis-3 | Dis-4 |
---|---|---|---|---|---|---|---|---|
1-NV | 14.01 | 3.98 | 1.07 | 58.1% | 0.0091 | 0.0355 | 0.0682 | 0.1018 |
2-CV | 14.58 | 4.35 | 1.14 | 54.2% | 0.0108 | 0.0448 | 0.0915 | 0.1465 |
3-FV | 15.61 | 4.71 | 1.22 | 52.9% | 0.0118 | 0.0502 | 0.1082 | 0.1778 |
1: we times Dis-x
by 10 for the ease of demonstration.