This repo contains the code for the paper
DVD: A Diagnostic Dataset for Multi-step Reasoning in Video Grounded Dialogue
Hung Le, Chinnadhurai Sankar, Seungwhan Moon, Ahmad Beirami, Alborz Geramifard, Satwik Kottur
[ArXiv].
Annual Meeting of the Association for Computational Linguistics (ACL), 2021
A video-grounded dialogue system is required to understand both dialogue, which contains semantic dependencies from turn to turn, and video, which contains visual cues of spatial and temporal scene variations. Building such dialogue systems is a challenging problem, involving various reasoning types on both visual and language inputs. Existing benchmarks do not have enough annotations to thoroughly analyze dialogue systems and understand their capabilities and limitations in isolation. These benchmarks are also not explicitly designed to minimise biases that models can exploit without actual reasoning. To address these limitations, in this paper, we present DVD, a Diagnostic Dataset for Video-grounded Dialogues. The dataset is designed to contain minimal biases and has detailed annotations for the different types of reasoning over the spatio-temporal space of video. Dialogues are synthesized over multiple question turns, each of which is injected with a set of cross-turn semantic relationships. We use DVD to analyze existing approaches, providing interesting insights into their abilities and limitations. In total, DVD is built from 11k CATER synthetic videos and contains 10 instances of 10-round dialogues for each video, resulting in more than 100k dialogues and 1M question-answer pairs.
If you find this code useful, consider citing our work:
@inproceedings{le-etal-2021-dvd,
title = "{DVD}: A Diagnostic Dataset for Multi-step Reasoning in Video Grounded Dialogue",
author = "Le, Hung and
Sankar, Chinnadhurai and
Moon, Seungwhan and
Beirami, Ahmad and
Geramifard, Alborz and
Kottur, Satwik",
booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)",
month = aug,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.acl-long.439",
doi = "10.18653/v1/2021.acl-long.439",
pages = "5651--5665",
abstract = "A video-grounded dialogue system is required to understand both dialogue, which contains semantic dependencies from turn to turn, and video, which contains visual cues of spatial and temporal scene variations. Building such dialogue systems is a challenging problem, involving various reasoning types on both visual and language inputs. Existing benchmarks do not have enough annotations to thoroughly analyze dialogue systems and understand their capabilities and limitations in isolation. These benchmarks are also not explicitly designed to minimise biases that models can exploit without actual reasoning. To address these limitations, in this paper, we present DVD, a Diagnostic Dataset for Video-grounded Dialogue. The dataset is designed to contain minimal biases and has detailed annotations for the different types of reasoning over the spatio-temporal space of video. Dialogues are synthesized over multiple question turns, each of which is injected with a set of cross-turn semantic relationships. We use DVD to analyze existing approaches, providing interesting insights into their abilities and limitations. In total, DVD is built from 11k CATER synthetic videos and contains 10 instances of 10-round dialogues for each video, resulting in more than 100k dialogues and 1M question-answer pairs. Our code and dataset are publicly available.",
}
The dataset can be downloaded here. The shared drive contains 2 zip files: dvd_dialogues, containing all dialogues of training, validation, and test split, and dvd_resnext101, containing the extracted ResNext101 features of all CATER videos (the features are extracted based on the best performing model pretrained on Kinetics here). Please refer to the dvd_codebase
folder in this repo to load and batch data (See Structure and Scripts below for more information).
The dataset statistics of DVD are:
Split | #Videos | #Dialogues | #Questions | # Unique Questions |
---|---|---|---|---|
DVD-Train | 6,157 | 61,551 | 615,510 | 360,334 |
DVD-Val | 1,540 | 15,396 | 153,960 | 99,211 |
DVD-Test | 3,299 | 32,978 | 329,780 | 200,346 |
DVD-Total | 10,996 | 109,925 | 1,099,250 | 620,739 |
The codes were built upon the codebase from CLEVR. Thank you the authors for sharing!.
The repo contains the following:
cater_preprocessing
: data preprocessing of CATER videosdownload.sh
: script to download the original CATER dataset (max2action and all_actions splits)update_scene.py
: update scene graph information from CATER original scene json filesutils.py
: functions on scene graphs, e.g. precompute filter options of objects, spatial relationships, video intervals
dvd_generation
: generation of dialogues/QA based on video annotation created fromcater_preprocessing
generate_dialogues.py
: main code to simulate dialoguesrun.sh
: script to rungenerate_dialogues.py
with setting parametersfilters
: filter functions to construct valid questionsdialogue_filters.py
: filter functions to simulate dependencies over turnsscene_filters.py
: manage filter functionsspatial_filters.py
: filter functions for valid object attributestemporal_filters.py
: filter fuctions for valid object actionsconstraint_filters.py
: check constraints by question types e.g.NO_NULL
= no empty attribute value
question_templates
: contain predefined question templates, synonyms, and metadata json files. Many of the question templates are built upon templates from CLEVR.add_action_attribute.py
: add action as an attribute in each object e.g. "the moving cube"add_atomic_action_query.py
: add new question templates for query of intervals with atomic actions (e.g. object with max. 1 action)add_compositional_query.py
: add new question templates for query of intervals with compositional actions (e.g. object with max. >1 action)add_cater_constraints.py
: add new constraint that is specific to CATER e.g. object containment constraint (only cones can contain other objects)add_dialogue_constraints.py
: add new constraint to make questions compatible in a dialogue e.g. question contains potential reference candidatesadd_other_templates.py
: add new question templates e.g. question about an action based on the action's ordercreate_templates.sh
: run all the above steps
simulators
: process logical programs to obtain ground-truth answers and generate question sentences in natural language formtemplate_dfs.py
: traverse through execution tree of program layoutsquestion_engine.py
: manage logical functionsspatial_question_engine.py
: implement logical spatial-based functions e.g. object count, object exist, etc.temporal_question_engine.py
: implement logical temporal-based functions e.g. action filter, action count, etc.question_generator.py
: generate question sentences
utils
: commonly used scriptsconfigs.py
: set up script parameters e.g. directory, number of templates per video, etc.data_loader.py
: load data, including metadata, synonyms, templates, CATER files, etc.dialogue_utils.py
: functions on simulated dialogue turns with linguistic dependenciesscene_utils.py
: functions on scene annotation e.g. spatial relationsutils.py
: other common functionsglobal_vars.py
: global variables e.g. acdtion mapping, action nouns/verbs
dvd_codebase
: basic function to load DVD datasetmain.py
: main process to load videos, dialogues, vocabulary, and create batchesrun.sh
: script to specify parameters, data and output directoriesconfigs/configs.py
: setting parametersdata
: basic functions to handle data and batch filesdata_handler.py
: functions to load dialogues, videos, and create vocabularydata_utils.py
: basic functions to support data loading and preprocessingdataset.py
: definition of Dataset and Batch classanalysis_utils.py
: functions to analyze output results e.g. by question types, question subtypes, transfer accuracy, etc.
Following these steps to preprocess CATER videos:
cd cater_preprocessing
- Run
./download.sh
to download CATER videos intocater_preprocessing
folder - Create video annotation from CATER files by running
python update_scene.py --cater_split <split> --scene_start_idx <start index> --scene_end_idx <end index>
Following these steps to simulate dialogues and generate annotations:
cd dvd_generation
- Optional: recreate question templates by running
./question_templates/create_templates.sh
. You can skip this step and directly use the question templates in this repo. - Generate dialogues and annotation by running
./run.sh <cater split> <start index> <end index> <output directory> <train/val/test split>
. Please change theproject_dir
to the root directory of this repo andcater_dir
to the preprocessed CATER video directory from above.
Following these steps to load and batch the dataset:
cd dvd_codebase
- Load DVD dataset and create batches by running
./run.sh <gpu device id> <1 if debugging with small data else 0>
. Please change the parameterdata_dir
andfea_dir
to the data and feature directories (e.g. unzipped location of dvd_dialogues and dvd_resnext101)
We created a notebook in this repo to demonstrate how different annotations can be extracted from DVD dialogues. Specifically, we used an example CATER video and DVD dialogue and defined helper functions to display various annotation details, e.g. question types/subtypes, tracked objects, tracked intervals, etc.
- Each DVDialogue json file contains one dialogue for a specific video from CATER. All object ids/actions and frame ids are referenced from the annotations of the CATER video.
- Each dialogue has 10 turns. In each turn, the data is a dictionary with the following attributes:
question
: a question about the videoanswer
: an answer to the above question based on the visual content of the videoturn_dependencies
: the cross-turn dependencies that are embedded in this turn. The 1st turn of each dialogue always havenone
type dependencies (no cross-turn relations)temporal
: relations that determine the video interval of the current turn, including:<1/2/3/4>_<flying/sliding/rotating>_among_<before/after/during>
: action reference to a set of action in the previous turn e.g. "among them, after the third slide"prior_<flying/sliding/rotating>_<before/after/during>
: action reference to a unique action in the previous turn e.g. "during this slide"after
/before
/during
: interval references to the interval of the previous turn e.g. "after this period"video_update
: topic transfer (temporal) with incremental video input to the video input of the previous turn e.g. "what about up until now"earlier_unique_obj_none
: interval with long-term object references e.g. "during the aforementioned yellow thing 's first rotation"last_unique_obj_none
: interval with short-term object references e.g. "before its third rotation"
spatial
: topic transfer (spatial) from the previous turn, including:left
/right
/front
/behind
e.g. "what about to the left of it?"
attribute
: topic transfer (attribute) from the previous turn, including:query_color
/query_shape
/query_size
/query_material
e.g. "what about its color?"
object
: object references to objects mentioned in dialouge context, including:earlier_unique
: long-term object references (> 1 turn distance) e.g. "the earlier mentioned red sphere"last_unique
: short-term object references (1-turn distance) e.g. "them", "it"
program
: the functional program that is used to solve the question in a multi-step reasoning process. This is a sequence of node, each node including the following attributes:type
: type of nodes e.g.filter_color
,count_object
, etc.inputs
: indices of the preceding nodes; their outputs are inputs to the current nodeside_inputs
: parameters of the current node e.g. "green", "yellow", "rubber", "before", "after", etc._output
: the output of the current node e.g. object count, object ids, interval period by start/end frame id- Please refer to the Appendix in the paper for more details of functional program types and data types
template
: template of the question, containing the information to determine the question interval type and question type/subtype. Other information includes:cutoff
: the cutoff event from the original CATER video. The input video of this turn will be from frame #0 to the cutoff eventused_periods
: contains all time periods up to the current turn. Each period is determine by a start event and end event- event: each cutoff event or start/end event is defined as by the start/end time of an object action. Event is in the form of
[<object_id>,start/end_rotating/sliding/flying, <order>, <frame id>]
- if an event is
None
, it is either the start or the end of the original CATER video used_objects
: all unique objects that are mentioned up to the previous turn. This is used to solve any long-term object references in the question of the curren turn. This is a dictionary with key as the object id and the values are:original_turn
: the original turn id the object was mentioned- object attributes mentioned in the dialogue so far:
<Z>
: size,<C>
: color,<M>
: material,<S>
: shape
This project is licensed under the license found in the LICENSE file in the root directory of this source tree.