This project is the implementation of the paper Multi-modal Factorized Bilinear Pooling with Co-Attention Learning for Visual Question Answering. Compared with existing state-of-the-art approaches such as MCB and MLB, our MFB models achieved superior performance on the large-scale VQA-1.0 and VQA-2.0 datasets. The MFB+CoAtt network architecture for VQA is illustrated in Figure 1.
Figure 1: The MFB+CoAtt Network architecture for VQA.The 3rd-party pytorch implementation for MFB(MFH) is released here. Great thanks, Liam!
Using the image features (the model with adaptive K ranges from [10,100]) here, our single MFH+CoAtt+GloVe model achieved the overall accuracy 68.76% on the test-dev set of VQA-2.0 dataset. With an ensemble of 8 models, we achieved the new state-of-the-art performance on the VQA-2.0 dataset's leaderboard with the overall accuracy 70.92%.
Our solution for the VQA Challenge 2017 is updated!
We proposed a high-order extention for MFB, i.e., the Multi-modal Factorized High-order Pooling (MFH). See the flowchart in Figure 2 and the implementations in mfh_baseline
and mfh-coatt-glove
folders. With an ensemble of 9 MFH+CoAtt+GloVe(+VG) models, we won the 2nd place (tied with another team) in the VQA Challenge 2017. The detailed information can be found in our paper (the second paper in the CITATION section on bottom of this page).
Our codes is implemented based on the high-quality vqa-mcb project. The data preprocessing and and other prerequisites are the same with theirs. Before running our scripts to train or test MFB model, see the Prerequisites
and Data Preprocessing
sections in the README of vqa-mcb's project first.
- The Caffe version required for our MFB is slightly different from the MCB. We add some layers, e.g., sum pooling, permute and KLD loss layers to the
feature/20160617_cb_softattention
branch of Caffe for MCB. Please checkout our caffe version here and compile it. Note that CuDNN is not compatible with sum pooling currently, you should switch it off to run the codes correctly.
We release the pretrained single model "MFB(or MFH)+CoAtt+GloVe+VG" in the papers. To the best of our knowledge, our MFH+CoAtt+GloVe+VG model report the best result (test-dev) with a single model on both the VQA-1.0 and VQA-2.0 datasets(train + val + visual genome). The corresponding results are shown in the table below. The results JSON files (results.zip for VQA-1.0) are also included in the model folders, which can be uploaded to the evaluation servers directly.
Datasets\Models | MCB | MFB | MFH | MFH (FRCN img features) |
---|---|---|---|---|
VQA-1.0 | 65.38% | 66.87% BaiduYun | 67.72% BaiduYun or Dropbox | 69.82% |
VQA-2.0 | 62.33%1 | 65.09% BaiduYun | 66.12% BaiduYun or Dropbox | 68.76%2 |
1 the MCB result on VQA-2.0 is provided by the VQA Challenge organizer.
2 this model is trained without the VG dataset. We observed the fact that introducing VG here does not bring performance improvement anymore.
We provide the scripts for training two MFB models from scratch, i.e., mfb-baseline
and mfb-coatt-glove
folders. Simply running the python scripts train_*.py
to train the models from scratch.
- Most of the hyper-parameters and configrations with comments are defined in the
config.py
file. - The solver configrations are defined in the
get_solver
function in thetrain_*.py
scripts. - Pretrained GloVe word embedding model (the spacy library) is required to train the mfb-coatt-glove model. The installation instructions of spacy and GloVe model can be found here.
To generate an answers JSON file in the format expected by the VQA evaluation code and VQA test server, you can use eval/ensemble.py
. This code can also ensemble multiple models. Running python ensemble.py
will print out a help message telling you what arguments to use.
This code is distributed under MIT LICENSE. The released models are only allowed for non-commercial use.
If the codes are helpful for your research, please cite
@article{yu2017mfb,
title={Multi-modal Factorized Bilinear Pooling with Co-Attention Learning for Visual Question Answering},
author={Yu, Zhou and Yu, Jun and Fan, Jianping and Tao, Dacheng},
journal={IEEE International Conference on Computer Vision (ICCV)},
pages={1839--1848},
year={2017}
}
@article{yu2018beyond,
title={Beyond Bilinear: Generalized Multi-modal Factorized High-order Pooling for Visual Question Answering},
author={Yu, Zhou and Yu, Jun and Xiang, Chenchao and Fan, Jianping and Tao, Dacheng},
journal={IEEE Transactions on Neural Networks and Learning Systems},
doi={10.1109/TNNLS.2018.2817340},
year={2018}
}
Zhou Yu [yuz(AT)hdu.edu.cn]