Question Aware Vision Transformer for Multimodal Reasoning

Roy Ganz • Yair Kittenplon • Aviad Aberdam • Elad Ben Avraham

Installation

First, clone this repository:

git clone https://github.com/amazon-science/QA-ViT.git
cd QA-ViT

Next, to install the requirements in a new conda environment, run:

conda env create -f qavit.yml
conda activate qavit

Data preparation

Download the following datasets from the official websites, and organize them as follows:

QA-ViT
├── configs
│   ├── ...
├── data
│   ├── textvqa
│   ├── stvqa
│   ├── OCRVQA
│   ├── vqav2
│   ├── vg
│   ├── textcaps
│   ├── docvqa
│   ├── infovqa
│   ├── vizwiz
├── models
│   ├── ...
├── ...

DeepSpeed Configuration

Our framework is based on deepspeed stage 2 and should be configured accordingly:

accelerate config

The accelerate config opens a dialog and should be set as follows:

Model	DeepSpeed stage	Grad accumulation	Grad clipping	Dtype
ViT+T5 base	2	❌	1.0	bf16
ViT+T5 large	2	❌	1.0	bf16
ViT+T5 xl	2	2	1.0	bf16

Training

After setting up DeepSpeed, run the following command to train QA-ViT:

accelerate launch run_train.py --config <config> --seed <seed>

Evaluation

After setting up DeepSpeed, run the following command to evaluate a trained model:

accelerate launch run_eval.py --config <config> --ckpt <ckpt>

where <config> and <ckpt> specify the desired evaluation configuration and trained model checkpoint, respectively.

Trained Checkpoints

We provide trained checkpoints of QA-ViT in the table below:

ViT+T5 base	ViT+T5 large	ViT+T5 xl
Download	Download	Download

LLaVA's checkpoints will be uploaded soon.

Main Results

Method	VQA^v2 vqa-score	COCO CIDEr	VQA^T vqa-score	VQA^ST ANLS	TextCaps CIDEr	VizWiz vqa-score	General Average	Scene-Text Average
ViT+T5-base	66.5	110.0	40.2	47.6	86.3	23.7	88.3	65.1
+ QA-ViT	71.7	114.9	45.0	51.1	96.1	23.9	93.3	72.1
Δ	+5.2	+4.9	+4.8	+3.5	+9.8	+0.2	+5.0	+7.0
ViT+T5-large	70.0	114.3	44.7	50.6	96.0	24.6	92.2	71.8
+ QA-ViT	72.0	118.7	48.7	54.4	106.2	26.0	95.4	78.9
Δ	+2.0	+4.4	+4.0	+3.8	+10.2	+1.4	+3.2	+7.1
ViT+T5-xl	72.7	115.5	48.0	52.7	103.5	27.0	94.1	77.0
+ QA-ViT	73.5	116.5	50.3	54.9	108.2	28.3	95.0	80.4
Δ	+0.8	+1.0	+2.3	+2.2	+4.7	+1.3	+0.9	+3.4

Citation

If you find this code or data to be useful for your research, please consider citing it.

@article{ganz2024question,
  title={Question Aware Vision Transformer for Multimodal Reasoning},
  author={Ganz, Roy and Kittenplon, Yair and Aberdam, Aviad and Avraham, Elad Ben and Nuriel, Oren and Mazor, Shai and Litman, Ron},
  journal={arXiv preprint arXiv:2402.05472},
  year={2024}
}