/Causal-VidQA

[CVPR 2022] A large-scale public benchmark dataset for video question-answering, especially about evidence and commonsense reasoning. The code used in our paper "From Representation to Reasoning: Towards both Evidence and Commonsense Reasoning for Video Question-Answering", CVPR2022.

Primary LanguagePythonMIT LicenseMIT

Causal-VidQA

The Causal-VidQA dataset contains 107,600 QA pairs from the Causal-VidQA dataset. The dataset aims to facilitate deeper video understanding towards video reasoning. In detail, we present the task of Causal-VidQA, which includes four types of questions ranging from scene description (description) to evidence reasoning (explanation) and commonsense reasoning (prediction and counterfactual). For commonsense reasoning, we set up a two-step solution by answering the question and providing a proper reason.

Here is an example from our dataset and the comparison between our dataset and other VisualQA datasets.

Example from our Causal-VidQA Dataset
Dataset Visual Type Visual Source Annotation Description Explanation Prediction Counterfactual #Video/Image #QA Video Length (s)
Motivation Image MS COCO Man $\times$ 10,191 - -
VCR Image Movie Clip Man $\times$ 110,000 290,000 -
MovieQA Video Movie Stories Auto $\times$ $\times$ 548 21,406 200
TVQA Video TV Show Man $\times$ $\times$ 21,793 152,545 76
TGIF-QA Video TGIF Auto $\times$ $\times$ $\times$ 71,741 165,165 3
ActivityNet-QA Video ActivityNet Man $\times$ $\times$ 5,800 58,000 180
Social-IQ Video YouTube Man $\times$ $\times$ 1,250 7,500 60
CLEVRER Video Game Engine Man 20,000 305,280 5
V2C Video MSR-VTT Man $\times$ $\times$ 10,000 115,312 30
NExT-QA Video YFCC-100M Man $\times$ $\times$ 5,440 52,044 44
Causal-VidQA Video Kinetics-700 Man 26,900 107,600 9
Comparison between our dataset and other VisualQA datasets

In this page, you can find the code of some SOTA VideoQA methods and the dataset for our CVPR conference paper.

  • Jiangtong Li, Li Niu and Liqing Zhang. From Representation to Reasoning: Towards both Evidence and Commonsense Reasoning for Video Question-Answering. CVPR, 2022. [paper link]

Download

  1. Visual Feature
  2. Text Feature
  3. Dataset Split
  4. Text annotation
  5. Original Data

Install

Please create an env for this project using miniconda (should install miniconda first)

>conda create -n causal-vidqa python==3.6.12
>conda activate causal-vidqa
>git clone https://github.com/bcmi/Causal-VidQA
>pip install -r requirements.txt 

Data Preparation

Please download the pre-computed features and QA annotations from Download 1-4. And place them in ['data/visual_feature'], ['data/text_feature'], ['data/split'] and ['data/QA']. Note that the Text annotation is package as QA.tar, you need to unpack it first before place it to ['data/QA'].

If you want to extract different video features and text features from our Causal-VidQA dataset, you can download the original data from Download 5 and do whatever your want to extract features.

Usage

Once the data is ready, you can easily run the code. First, to run these models with GloVe feature, you can directly train the B2A by:

>sh bash/train_glove.sh

Note that if you want to train the model with BERT feature, we suggest your to first load the BERT feature to sharedarray by:

>python dataset/load.py

and then train the B2A with BERT feature by:

>sh bash/train_bert.sh.

After the train shell file is conducted, you can find the the prediction file under ['results/model_name/model_prefix.json'] and you can evaluate the prediction results by:

>python eval_mc.py

You can also obtain the prediction by running:

>sh bash/eval.sh

The command above will load the model from ['experiment/model_name/model_prefix/model/best.pkl'] and generate the prediction file.

Hint: we have release a trained model for B2A method, please place this the trained weight in ['experiment/B2A/B2A/model/best.pkl'] and then make prediction by running:

>sh bash/eval.sh

(The results may be slightly different depending on the environments and random seeds.)

(For comparison, please refer to the results in our paper.)

Citation

@InProceedings{li2022from,
    author    = {Li, Jiangtong and Niu, Li and Zhang, Liqing},
    title     = {From Representation to Reasoning: Towards both Evidence and Commonsense Reasoning for Video Question-Answering},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2022}
}

Acknowledgement

Our reproduction of the methods is mainly based on the Next-QA and other respective official repositories, we thank the authors to release their code. If you use the related part, please cite the corresponding paper commented in the code.