We present a 45k multi-modal dialogue dataset and the dataset creation method. This dataset is meant for training and evaluating multi-modal dialogue systems. Each multi-modal dialogue instance consists of a textual response and a dialogue context with multiple text utterances and an image. The details used in our creation method can be found in the paper. The work was published in ACL 2021.
The dataset can be found at here.
There are 3 files in the above link. Each zip(or egg) file compressed json and npy format files for training and evaluation. Each line in the json file is a json consisting the following keys:
Key | Description |
---|---|
dialog | Dialogue context and response |
replaced_idx | Index(turn) of dialogue context to be replaced |
img_idx | Index of image tensor to replace in the npy file |
score | The similarity score between |
dialog_dataset | Source dialogue dataset |
dialog_file | Used file name in the source dialogue dataset |
img_dataset | Source image dataset |
img_file | Used file name in the source image dataset |
Our multi-modal dialogue dataset is constructed based on 3 source dialogue datasets and 2 image captioning datasets. We provide download and paper links of all our source datasets.
Source Dataset | Paper | Type | Download link |
---|---|---|---|
DailyDialog | paper | text | http://yanran.li/dailydialog.html |
Persona-Chat | paper | text | https://parl.ai/about/ |
EmpatheticDialogues | paper | text | https://github.com/facebookresearch/EmpatheticDialogues |
MS-COCO (2014) | paper | image | https://cocodataset.org/#download |
Flickr 30k | paper | image | https://www.kaggle.com/hsankesara/flickr-image-dataset |
Before running our code, you have to create Anaconda environment using given enviroment.yaml file.
conda env create --file environment.yaml
we provide two source code sets, similarity-calculation and dialogue-prediction.
With similarity-calculation source code, you can calculate the similarities between source dialogue dataset and image dataset using pretrained VSRN weight. With dialogue-prediction source code, you can run the current and next dialogue prediction task using our multi-modal dialogue dataset as in the paper.
To directly run our similarity-calculation code, you have to download all source dialogue, image datasets, and weight of pre-trained VSRN. Especially for image dataset, we follow VSRN that uses pre-processed image features in which bottom-up attention is applied. You can find the download link for the all image features and the weight of pre-trained VSRN in here.
After downloading all the necessary dataset and weights in to the dataset directory, then run calculating_similarity.py:
python similarity-calculation/calculating_similarity.py
To run our current and next dialogue prediction task, you have to download our multi-modal dialogue dataset in to the dataset directory. Then, run predicting_dialogue.py:
For current turn prediction task:
python dialogue-prediction/predicting_dialogue.py --model_name $MODEL_NAME --gpu_id $GPU_ID --task current
For next turn prediction task:
python dialogue-prediction/predicting_dialogue.py --model_name $MODEL_NAME --gpu_id $GPU_ID --task next
If you find the data useful and use it for your work, please consider citing the following:
@inproceedings{lee-etal-2021-constructing,
title = "Constructing Multi-Modal Dialogue Dataset by Replacing Text with Semantically Relevant Images",
author = "Lee, Nyoungwoo and
Shin, Suwon and
Choo, Jaegul and
Choi, Ho-Jin and
Myaeng, Sung-Hyon",
booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)",
month = aug,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.acl-short.113",
pages = "897--906",
}