/VQA-but-classify

Visual question answering with transformers in Vietnamese, other languages also supported

Primary LanguagePython

Deep Modular Co-Attention Networks (MCAN)

This repository corresponds to the PyTorch implementation of the MCAN paper for VQA in Vietnamese

Orignal MCAN: link

By using the commonly used bottom-up-attention visual features, a single MCAN model delivers 55.48% overall accuracy on the test split of ViVQA dataset respectively, which significantly outperform existing state-of-the-arts model in Vietnamese. Please check "ViVQA: Vietnamese Visual Question Answering" paper for details.

Overview our MCAN reimplement

Updates

June 2, 2023

Table of Contents

  1. Prerequisites
  2. Training and Validation
  3. Testing
  4. Pretrained models

Prerequisites

Software and Hardware Requirements

You may need a machine with at least 1 GPU (>= 8GB), 20GB memory and 50GB free disk space. We strongly recommend to use a SSD drive to guarantee high-speed I/O.

You should first install some necessary packages.

  1. Install Python >= 3.5

  2. Install Cuda >= 9.0 and cuDNN

  3. Install PyTorch >= 0.4.1 with CUDA (Pytorch 1.x, 2.0 is also supported).

    $ pip install -r requirements.txt

Dataset

  1. Download the ViVQA dataset, more information about relevant published papers and datasets, please visit UIT-NLP.

And finally, the datasets folders will have the following structure:


-- data
  |-- train-val
  |  |-- 000000000001.jpg
  |  |-- ...
  |-- test
  |  |-- 000000000002.jpg
  |  |-- ...
  |-- vivqa_train.json
  	|-- 
		{
			"images": [
				{
					"id": 68857,
					"filename": "000000068857.jpg",
					"filepath": "train"
				},...
			]
			"annotations": [
			{
				"id": 5253,
				"image_id": 482062,
				"question": "cái gì trên đỉnh đồi cỏ nhìn lên bầu trời xanh sáng",
				"answers": [
					"ngựa vằn"
				]
			},...
		}
	|-- vivqa_dev.json
	   |--
	|-- vivqa_test.json
       |--

Training and Validation

The following script will start training with the default hyperparameters in config.yaml

$ python3 main.py --config config.yaml

Testing

You must go to main.py file and comment like :"#STVQA_Task(config).training()" Then go to config.yaml file and fix the "images_folder" to test images folder like: "images_folder: test"

$ python3 main.py --config config.yaml

Pretrained models

We use two pretrained models, namely the ViT(Vision Transfomer) paper and PhoBERT paper