George Washington University, Capstone - Fall 2022

Project Description

Improve on research¹ done in multimodal task of Diagram Question Answering on the AI2D Diagram Dataset².

Team Members
Folder Structure
Background and Related Works
How to Run
Architecture
Results
Presentation
Paper
References
Licensing

Team Members

Folder Structure

.
├── archive                 # Experimental scripts
├── assets                  # For storing other supporting assets like images, logos, gifs, etc.
├── documents               # Research documents
├── example_data            # Sample of dataset 
│ 
├── src                     # Main code directory
│   ├── models              # Directory for models
│   ├── tests               # Directory for doing R&D and code testings
│   ├── utils               # Directory for utility functions to support project
│ 
└── requirements.txt        # Python package requirements

Background and Related Works

Original Paper

Published on 24 Mar 2016, A Diagram Is Worth A Dozen Images set out to "study the problem of diagram interpretation and reasoning, the challenging task of identifying the structure of a diagram and the semantics of its constituents and their relationships". ²⁰

Their solution was to create Diagram Parse Graphs (DPG), which they used to model the structure of diagrams. DPGs are a directed graph with nodes representing diagram elements and edges representing relationships between diagram elements. Derived from the diagrams, these DPGs were combined with the natural language question to generate the final answer. ²⁰

This paper generated a unique architecture that was purpose built to DQA. Compared to more general architectures, this architecture was able to achieve a 7% improvement in accuracy. It should be noted that the baseline is 25% accuracy, and the best model (DPGs) recieved ²⁰

Diagram Question Answering

Solving Diagram Question Answering (DQA) is a multimodal task that requires the model to understand the diagram and the question simultaneously. The general approach is:

Extract features from the diagram and question
Combine the features to generate a final representation
Use the final representation to predict the answer

DQA system can be seen as an algorithm that takes as input an image and a natural language question about the image and generates a natural language answer as the output.

A good DQA system must be capable of solving a broad spectrum of typical NLP and CV tasks, as well as reasoning about image content. It is clearly a multi-discipline AI research problem, involving CV, NLP and Knowledge Representation & Reasoning (KR). ¹⁹

What We DID

We initially wanted to improve on the original paper by trying out more generalizable models and different approaches to the problem. We were unfamiliar with building graph neurnal networks or "DPGs" so we wanted to attempt to solve this problem with tools more familiar to us.
We used a transformer-based model to extract and combine features from the diagram and question. The same model also predict the answer.

How to Run

This program is intended to be run on an EC2 configured with Ubuntu. The following instructions assume a fresh install.

Requirements

The dataset is contained on AWS S3. You may need AWS keys to download the dataset.
Python 3.9 or higher.

Setup

Clone the repo and enter into the directory

git clone https://github.com/alexiskaldany/CAP22FA.git
cd CAP22FA

Create the virtual environment and install the dependencies.

python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt

Run prepare_and_download.py to download the dataset and prepare the data for training. A data folder will be created inside src where all the data will be stored. This function also triggers the annotation scripts and takes care of all preprocessing.

python3 ./src/utils/prepare_and_download.py

Issues

There are small differences in the way paths work on different operating systems. Efforts have been taken to ameliorate this,

Modeling

Execute model training for a specified model setup type:

cd src/models/
python3 model_training.py

or to execute as background program:

cd src/models/
nohup python3 model_training.py
ps ax | grep "model_training.py"

Plot Results

Execute specified plot types for training and testing results:

cd src/models/
python3 plot_model_results.py

Architecture

Environment Architecture

Model Architecture

Results

Presentation

Final Presentation Slides

Paper

Final Paper

References

@article{Kembhavi2016ADI,
  title={A Diagram is Worth a Dozen Images},
  author={Aniruddha Kembhavi and Michael Salvato and Eric Kolve and Minjoon Seo and Hannaneh Hajishirzi and Ali Farhadi},
  journal={ArXiv},
  year={2016},
  volume={abs/1603.07396}
}

AI2 Diagram Dataset (AI2D)

AI2 Diagram Dataset (AI2D) was accessed on 9/5/2022 from https://registry.opendata.aws/allenai-diagrams.

Paper Code
VISUALBERT: A SIMPLE AND PERFORMANT BASELINE FOR VISION AND LANGUAGE, Li et al., 2019
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision. Kim et al., 2021
LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking. Huang et al., 2022
VisualBERT for Multiple Choice Hugging Face
VisualBERT for Question Answering Hugging Face
VILT for Question Answering Hugging Face
LayoutMV3 Hugging Face
VisualBERT Demo
BERT Multiple Choice Sample
Fine Tuning on Multiple Choice Task
Hugging Face
PyTorch AdamW Optimizer
PyTorch Cross Entropy
Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dolla ́r, and C Lawrence Zitnick. Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
Tryolabs
A Diagram is Worth a Dozen Images

Licensing

MIT License

alexiskaldany/CAP22FA