Constructing Multi-Modal Dialogue Dataset by Replacing Text with Semantically Relevant Images

We present a 45k multi-modal dialogue dataset and the dataset creation method. This dataset is meant for training and evaluating multi-modal dialogue systems. Each multi-modal dialogue instance consists of a textual response and a dialogue context with multiple text utterances and an image. The details used in our creation method can be found in the paper. The work was published in ACL 2021.

Link to the dataset

The dataset can be found at here.

Dataset Details

There are 3 files in the above link. Each zip(or egg) file compressed json and npy format files for training and evaluation. Each line in the json file is a json consisting the following keys:

Key	Description
dialog	Dialogue context and response
replaced_idx	Index(turn) of dialogue context to be replaced
img_idx	Index of image tensor to replace in the npy file
score	The similarity score between
dialog_dataset	Source dialogue dataset
dialog_file	Used file name in the source dialogue dataset
img_dataset	Source image dataset
img_file	Used file name in the source image dataset

Source Dataset Details

Our multi-modal dialogue dataset is constructed based on 3 source dialogue datasets and 2 image captioning datasets. We provide download and paper links of all our source datasets.

Source Dataset	Paper	Type	Download link
DailyDialog	paper	text	http://yanran.li/dailydialog.html
Persona-Chat	paper	text	https://parl.ai/about/
EmpatheticDialogues	paper	text	https://github.com/facebookresearch/EmpatheticDialogues
MS-COCO (2014)	paper	image	https://cocodataset.org/#download
Flickr 30k	paper	image	https://www.kaggle.com/hsankesara/flickr-image-dataset

Code Details

Before running our code, you have to create Anaconda environment using given enviroment.yaml file.

conda env create --file environment.yaml

we provide two source code sets, similarity-calculation and dialogue-prediction.

With similarity-calculation source code, you can calculate the similarities between source dialogue dataset and image dataset using pretrained VSRN weight. With dialogue-prediction source code, you can run the current and next dialogue prediction task using our multi-modal dialogue dataset as in the paper.

1. Similarity Calculation

To directly run our similarity-calculation code, you have to download all source dialogue, image datasets, and weight of pre-trained VSRN. Especially for image dataset, we follow VSRN that uses pre-processed image features in which bottom-up attention is applied. You can find the download link for the all image features and the weight of pre-trained VSRN in here.
After downloading all the necessary dataset and weights in to the dataset directory, then run calculating_similarity.py:

python similarity-calculation/calculating_similarity.py

2. Dialogue Prediction

To run our current and next dialogue prediction task, you have to download our multi-modal dialogue dataset in to the dataset directory. Then, run predicting_dialogue.py:

For current turn prediction task:

python dialogue-prediction/predicting_dialogue.py --model_name $MODEL_NAME --gpu_id $GPU_ID --task current

For next turn prediction task:

python dialogue-prediction/predicting_dialogue.py --model_name $MODEL_NAME --gpu_id $GPU_ID --task next

References

If you find the data useful and use it for your work, please consider citing the following:

@inproceedings{lee-etal-2021-constructing,
    title = "Constructing Multi-Modal Dialogue Dataset by Replacing Text with Semantically Relevant Images",
    author = "Lee, Nyoungwoo  and
      Shin, Suwon  and
      Choo, Jaegul  and
      Choi, Ho-Jin  and
      Myaeng, Sung-Hyon",
    booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.acl-short.113",
    pages = "897--906",
}

shh1574/multi-modal-dialogue-dataset