- This project provides codes to reproduce the results of Multi-Modal Graph Neural Network for Joint Reasoning on Vision and Scene Text on TextVQA dataset
- We are grateful to MMF (or Pythia), an excellent VQA codebase provided by Facebook, on which our codes are developed
- We achieved 32.46% accuracy (ensemble) on test set of TextVQA
- Pytorch 1.0.1 post
- We have performed experiments on Maxwell Titan X GPU, which has 12GB of GPU memory
- See
requirements.txt
for the required python packages and run to install them
Let's begin from cloning this repository
$ git clone https://github.com/ricolike/mmgnn-textvqa.git
$ cd mmgnn-textvqa
$ pip install -r requirements.txt
- cached data: To boost data loading speed under limited memory size (64G) and to speed
up calculation, we cached intermediate dataloader results in storage. Download
data
(around 54G, and around 120G unzipped), and modify
line 11 (
fast_dir
) in config to the absolute path where you save them - other files: Download other needed files (vocabulary, OCRs, some parameters of
backbone) here
(less than 1G), and make a soft link named
data
under repo root towards where you saved them
- Create a new model folder under
ensemble
, sayfoo
, and then copy our config into it
$ mkdir -p ensemble/foo
$ cp ./configs/vqa/textvqa/s_mmgnn.yml ./ensemble/foo
- Start training, and parameters will be saved in
ensemble/foo
$ python tools/run.py --tasks vqa --datasets textvqa --model s_mmgnn --config ensemble/foo/s_mmgnn.yml -dev cuda:0 --run_type train`
- First-run of this repo will automatically download glove in
pythia/.vector_cache
, let's be patient. If we made it, we will find as_mmgnnbar_final.pth
in the model folderensemble/foo
- If you want to skip training procedure, a trained model is provided on which we can directly do inference
- Start inference by running the following command. And if you made it, you will find three new files generated under the model folder, two ends with
_evailai.p
are ready to be submitted to evalai to check the results
$ python tools/run.py --tasks vqa --datasets textvqa --model s_mmgnn --config ensemble/bar/s_mmgnn.yml --resume_file <path_to_pth> -dev cuda:0 --run_type all_in_one
@article{gao2020multi,
title={Multi-Modal Graph Neural Network for Joint Reasoning on Vision and Scene Text},
author={Gao, Difei and Li, Ke and Wang, Ruiping and Shan, Shiguang and Chen, Xilin},
journal={arXiv preprint arXiv:2003.13962},
year={2020}
}
Question: "What is the name of the bread sold at the place?"
Answer: "Panera"
(where white box is the answer predicted, green boxes are OCRs Panera attends to, and
red boxes are visual ROIs Panera attends to; box weight indicating attention strength)