Audio Visual Scene-Aware Dialog (AVSD) Challenge at the 10th Dialog System Technology Challenge (DSTC) https://sites.google.com/dstc.community/dstc10
The challenge setup for AVSD-DSTC10 is available: https://github.com/ankitshah009/AVSD-DSTC10_baseline
https://docs.google.com/forms/d/e/1FAIpQLSe6FNXNhpb-VjiILamx6EjR_ducdRLgnP9keh2fa-Q8WvrcJQ/viewform
To upload your files, please use your google account.
After register DSTC10 and then get the baseline system from HERE
Please Register HERE for participating in the DSTC10 challenge
The task setup for the previous challenges in DSTC7 and DSTC8 allowed the participants
to use human-created video captions to generate answers for the dialog questions.
However, such manual descriptions are not available in real-world applications, where the
system needs to learn to produce the answers without the captions.
To encourage progress towards this end, we propose a third challenge
in DSTC10 under the video-based scene-aware dialog track.
In this challenge, we seek evidence from the system to support the generated answer
via detecting the temporal segments in the videos corresponding to the answer.
June 14th, 2021: Answer generation data release
June 30th, 2021: Answer reasoning temporal localization data and baseline release:
**Releasing AVSD@DSTC10 registrants only**
September 13th, 2021: Test Data release
September 21st 2021: Test Submission due -> 28th
November 1st 2021: Challenge paper submission due
January or February, 2022: Workshop
Goal: Answer generation without using manual descriptions for inference
You can train models using manual descriptions but CANNOT use them for testing.
Video description capability needs to be embedded within the answer generation models.
Data conditions:
a. Use the provided video including audio and video features and localization information with answer reasoning,
the dialog history and manual video descriptions (scripts and summary) data for training for the Closed condition
- text descriptions or scripts used by the actor to enact in the videos
- summary generated by the questioners after holding 10 QAs.
b. Publicly available external data and pre-trained models may also be used for training as a sub task for the Open condition
Goal: Answer reasoning temporal Localization
To support answers, evidence is required to be shown without using manual descriptions.
For example, When a system generated answer is “A dog is barking.”, the sound of the dog’s barking and
the dog must be grounded in the video as evidence.
The localization of audiovisual evidence is required for each generated answer.
To train reasoning localization, begin and end timing of the grounding/evidence will be additionally provided for the training data.
Data conditions:
Temporal localization information for answer reasoning is provided as begin and end timestamps showing evidence scenes
a. Use the provided audio and video features with localization information with answer reasoning only for the Closed condition
b. Any publicly available data and pre-trained models may also be used for training as a sub task for the Open condition
Likert scale by 5 humans + Similarity compared with single and multiple ground truths
Intersection over Union (IoU) by comparing with “single” Evidence timing
The baseline system is based on Audio Visual Transformer for dialogue response sentence generation.
Information on gaining access to baseline is HERE
(Registration is required.)
Output:
Answer generation considering dialog context
Evidence timing detection based on attention weights
Evaluation:
Validation data (1,787) was evaluated using “single” Answer and Evidence timing
Sentence similarity: BLEU, METEOR, CIDEr
Timing overlap: Intersection over Union (IoU)
Official evaluation:
- Likert scale by 5 humans
- Similarity compared with single and multiple ground truths
Additional Data:
Evidence timing for Training data (7,659) will be provided soon.
- To join the mailing list: visit https://groups.google.com/a/dstc.community/forum/#!forum/list/join
- To post a message: send your message to list@dstc.community
- To leave the mailing list: visit https://groups.google.com/a/dstc.community/forum/#!forum/list/unsubscribe
AVSD@DSTC10 Organizer: Chiori Hori & Ankit Shah
Ankit Shah, Shijie Geng, Peng Gao, Anoop Cherian, Chiori Hori and Tim K. Marks