This repository reproduce VQA experiments of OFA model.
As the original paper, we use the same dataset and evaluation metric as VQA v2.0. The dataset can be downloaded from VQA v2.0. The evaluation metric from VQA v2.0. However, their preprocessing method results in a more than 100 GB zip file, which is difficult to download and unzip. I use the chunked version in OFA-Sys/OFA#68 (comment)
bash download.sh
cd dataset/vqa_data
cat vqa_train_* > vqa_train.tsv
cat vqa_test_* > vqa_test.tsv
Environment
export PYTHONPATH=$PYTHONPATH:/data/hzz5361/vision_and_lang/final/OFA/fairseq
export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python
Evaluation
Download Pretrained Model
The allcand evaluation is very demanding on GPU memory. As my RTX A5000 only has 24GB memory, I can only run the allcand evaluation with batch size 4.
cd run_scripts/vqa
bash evaluate_vqa_beam_base.sh val
bash evaluate_vqa_base_allcand.sh val
Results
Task | Image Captioning | VQA | Visual Entailment | Referring Expression Comprehension | ||
---|---|---|---|---|---|---|
Dataset | COCO | VQA v2 | SNLI-VE | RefCOCO | RefCOCO+ | RefCOCOg |
Split | Karpathy test (CE/CIDEr) | test-dev/test-std | val/test | val/test-a/test-b | val/test-a/test-b | val-u/test-u |
Metric | CIDEr | Acc. | Acc. | Acc. | ||
OFATiny | 119.0 / 128.7 | 70.3 / 70.4 | 85.3 / 85.2 | 80.20 / 84.07 / 75.00 | 68.22 / 75.13 / 57.66 | 72.02 / 69.74 |
OFAMedium | 130.4 / 140.3 | 75.4 / 75.5 | 86.6 / 87.0 | 85.34 / 87.68 / 77.92 | 76.09 / 83.04 / 66.25 | 78.76 / 78.58 |
OFABase | 138.2 / 146.7 | 78.0 / 78.1 | 89.3 / 89.2 | 88.48 / 90.67 / 83.30 | 81.39 / 87.15 / 74.29 | 82.29 / 82.31 |
OFALarge | 142.2 / 150.7 | 80.4 / 80.7 | 90.3 / 90.2 | 90.05 / 92.93 / 85.26 | 85.80 / 89.87 / 79.22 | 85.89 / 86.55 |
OFAHuge | 145.3 / 154.9 | 82.0 / 82.0 | 91.0 / 91.2 | 92.04 / 94.03 / 88.44 | 87.86 / 91.70 / 80.71 | 88.07 / 88.78 |
OFABase | 138.2 / 146.7 | 78.0 / 78.1 | 89.3 / 89.2 | 88.48 / 90.67 / 83.30 | 81.39 / 87.15 / 74.29 | 82.29 / 82.31 |
OFABase-Beam-3 | 77.94/- | |||||
OFABase-Beam-10 | 77.56/- |