The implementation of paper Dual Temporal Grounding-enhanced Video Dialog
Please download the required data from the homepage of DSTC10 and place it in data/
, including:
- test_set4DSTC7-AVSD.json
- test_set4DSTC8-AVSD.json
- train_set4DSTC8-AVSD+reason
- valid_set4DSTC8-AVSD+reason
If you want to additionally use audio features:
- vggish.tgz
- vggish_testset.tgz
As for video features, please download the RGB frames from Charades and use the pre-trained S3D model to extract features.
- python==3.6.9
- torch==1.7.0+cu92
- tqdm
- boto3
- requests
- pandas
- nlg-eval (Install Java 1.8.0 (or higher) firstly)
conda create -n DTGVD python=3.6.9 tqdm boto3 requests pandas
conda activate DTGVD
pip install torch==1.7.1+cu92
pip install git+https://github.com/Maluuba/nlg-eval.git@master
Our model consists of two parts, i.e. grounding and dialog, which can be switched directly by changing --task_type
.
To complete the whole process, please first use the grounding part to get the timestamps corresponding to each QA and save it as a pickle file.
In the evaluation part of dialog, use --grounding_results
to point to the pickle file to get the final results.
The hyperparameters are displayed in main.py, if you want to use the default hyperparameters, you can run directly:
python -m torch.distributed.launch --nproc_per_node=2 --master_port 10000 main.py
python -m torch.distributed.launch --nproc_per_node=1 --master_port 10000 main.py