Improve on research1 done in multimodal task of Diagram Question Answering on the AI2D Diagram Dataset2.
- Team Members
- Folder Structure
- Background and Related Works
- How to Run
- Architecture
- Results
- Presentation
- Paper
- References
- Licensing
.
├── archive # Experimental scripts
├── assets # For storing other supporting assets like images, logos, gifs, etc.
├── documents # Research documents
├── example_data # Sample of dataset
│
├── src # Main code directory
│ ├── models # Directory for models
│ ├── tests # Directory for doing R&D and code testings
│ ├── utils # Directory for utility functions to support project
│
└── requirements.txt # Python package requirements
Published on 24 Mar 2016, A Diagram Is Worth A Dozen Images
set out to "study the problem of diagram interpretation and reasoning, the challenging task of identifying the structure of a diagram and the semantics of its constituents and their relationships". 20
Their solution was to create Diagram Parse Graphs (DPG), which they used to model the structure of diagrams. DPGs are a directed graph with nodes representing diagram elements and edges representing relationships between diagram elements. Derived from the diagrams, these DPGs were combined with the natural language question to generate the final answer. 20
This paper generated a unique architecture that was purpose built to DQA. Compared to more general architectures, this architecture was able to achieve a 7% improvement in accuracy. It should be noted that the baseline is 25% accuracy, and the best model (DPGs) recieved 20
Solving Diagram Question Answering (DQA) is a multimodal task that requires the model to understand the diagram and the question simultaneously. The general approach is:
- Extract features from the diagram and question
- Combine the features to generate a final representation
- Use the final representation to predict the answer
DQA system can be seen as an algorithm that takes as input an image and a natural language question about the image and generates a natural language answer as the output.
A good DQA system must be capable of solving a broad spectrum of typical NLP and CV tasks, as well as reasoning about image content. It is clearly a multi-discipline AI research problem, involving CV, NLP and Knowledge Representation & Reasoning (KR). 19
-
We initially wanted to improve on the original paper by trying out more generalizable models and different approaches to the problem. We were unfamiliar with building graph neurnal networks or "DPGs" so we wanted to attempt to solve this problem with tools more familiar to us.
-
We used a transformer-based model to extract and combine features from the diagram and question. The same model also predict the answer.
This program is intended to be run on an EC2 configured with Ubuntu. The following instructions assume a fresh install.
- The dataset is contained on AWS S3. You may need AWS keys to download the dataset.
- Python 3.9 or higher.
- Clone the repo and enter into the directory
git clone https://github.com/alexiskaldany/CAP22FA.git
cd CAP22FA
- Create the virtual environment and install the dependencies.
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
- Run
prepare_and_download.py
to download the dataset and prepare the data for training. Adata
folder will be created insidesrc
where all the data will be stored. This function also triggers the annotation scripts and takes care of all preprocessing.
python3 ./src/utils/prepare_and_download.py
There are small differences in the way paths work on different operating systems. Efforts have been taken to ameliorate this,
Execute model training for a specified model setup type:
cd src/models/
python3 model_training.py
or to execute as background program:
cd src/models/
nohup python3 model_training.py
ps ax | grep "model_training.py"
Execute specified plot types for training and testing results:
cd src/models/
python3 plot_model_results.py
@article{Kembhavi2016ADI,
title={A Diagram is Worth a Dozen Images},
author={Aniruddha Kembhavi and Michael Salvato and Eric Kolve and Minjoon Seo and Hannaneh Hajishirzi and Ali Farhadi},
journal={ArXiv},
year={2016},
volume={abs/1603.07396}
}
AI2 Diagram Dataset (AI2D) was accessed on 9/5/2022 from https://registry.opendata.aws/allenai-diagrams.
- Paper Code
- VISUALBERT: A SIMPLE AND PERFORMANT BASELINE FOR VISION AND LANGUAGE, Li et al., 2019
- ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision. Kim et al., 2021
- LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking. Huang et al., 2022
- VisualBERT for Multiple Choice Hugging Face
- VisualBERT for Question Answering Hugging Face
- VILT for Question Answering Hugging Face
- LayoutMV3 Hugging Face
- VisualBERT Demo
- BERT Multiple Choice Sample
- Fine Tuning on Multiple Choice Task
- Hugging Face
- PyTorch AdamW Optimizer
- PyTorch Cross Entropy
- Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dolla ́r, and C Lawrence Zitnick. Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
- Tryolabs
- A Diagram is Worth a Dozen Images
- MIT License