/CAP22FA

Fall 22` Capstone Project, by Josh Ting and Alexis Kaldany

Primary LanguagePython

George Washington University, Capstone - Fall 2022

sample_diagram

Project Description

Improve on research1 done in multimodal task of Diagram Question Answering on the AI2D Diagram Dataset2.

Table of Contents

  1. Team Members
  2. Folder Structure
  3. Background and Related Works
  4. How to Run
  5. Architecture
  6. Results
  7. Presentation
  8. Paper
  9. References
  10. Licensing

Team Members

Folder Structure

.
├── archive                 # Experimental scripts
├── assets                  # For storing other supporting assets like images, logos, gifs, etc.
├── documents               # Research documents
├── example_data            # Sample of dataset 
│ 
├── src                     # Main code directory
│   ├── models              # Directory for models
│   ├── tests               # Directory for doing R&D and code testings
│   ├── utils               # Directory for utility functions to support project
│ 
└── requirements.txt        # Python package requirements

Background and Related Works

Original Paper

sample_diagram Published on 24 Mar 2016, A Diagram Is Worth A Dozen Images set out to "study the problem of diagram interpretation and reasoning, the challenging task of identifying the structure of a diagram and the semantics of its constituents and their relationships". 20

Their solution was to create Diagram Parse Graphs (DPG), which they used to model the structure of diagrams. DPGs are a directed graph with nodes representing diagram elements and edges representing relationships between diagram elements. Derived from the diagrams, these DPGs were combined with the natural language question to generate the final answer. 20

This paper generated a unique architecture that was purpose built to DQA. Compared to more general architectures, this architecture was able to achieve a 7% improvement in accuracy. It should be noted that the baseline is 25% accuracy, and the best model (DPGs) recieved 20

Diagram Question Answering

Solving Diagram Question Answering (DQA) is a multimodal task that requires the model to understand the diagram and the question simultaneously. The general approach is:

  • Extract features from the diagram and question
  • Combine the features to generate a final representation
  • Use the final representation to predict the answer

DQA system can be seen as an algorithm that takes as input an image and a natural language question about the image and generates a natural language answer as the output.

A good DQA system must be capable of solving a broad spectrum of typical NLP and CV tasks, as well as reasoning about image content. It is clearly a multi-discipline AI research problem, involving CV, NLP and Knowledge Representation & Reasoning (KR). 19

What We DID

sample_diagram

  • We initially wanted to improve on the original paper by trying out more generalizable models and different approaches to the problem. We were unfamiliar with building graph neurnal networks or "DPGs" so we wanted to attempt to solve this problem with tools more familiar to us.

  • We used a transformer-based model to extract and combine features from the diagram and question. The same model also predict the answer.

How to Run

This program is intended to be run on an EC2 configured with Ubuntu. The following instructions assume a fresh install.

Requirements

  • The dataset is contained on AWS S3. You may need AWS keys to download the dataset.
  • Python 3.9 or higher.

Setup

  1. Clone the repo and enter into the directory
git clone https://github.com/alexiskaldany/CAP22FA.git
cd CAP22FA
  1. Create the virtual environment and install the dependencies.
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
  1. Run prepare_and_download.py to download the dataset and prepare the data for training. A data folder will be created inside src where all the data will be stored. This function also triggers the annotation scripts and takes care of all preprocessing.
python3 ./src/utils/prepare_and_download.py

Issues

There are small differences in the way paths work on different operating systems. Efforts have been taken to ameliorate this,

Modeling

Execute model training for a specified model setup type:

cd src/models/
python3 model_training.py

or to execute as background program:

cd src/models/
nohup python3 model_training.py
ps ax | grep "model_training.py"

Plot Results

Execute specified plot types for training and testing results:

cd src/models/
python3 plot_model_results.py

Architecture

Environment Architecture

sample_diagram sample_diagram

Model Architecture

sample_diagram sample_diagram

Results

sample_diagram

Presentation

Final Presentation Slides

Paper

Final Paper

References

  1. Github Repo
  2. A Diagram is Worth a Dozen Images
@article{Kembhavi2016ADI,
  title={A Diagram is Worth a Dozen Images},
  author={Aniruddha Kembhavi and Michael Salvato and Eric Kolve and Minjoon Seo and Hannaneh Hajishirzi and Ali Farhadi},
  journal={ArXiv},
  year={2016},
  volume={abs/1603.07396}
}
  1. AI2 Diagram Dataset (AI2D)
AI2 Diagram Dataset (AI2D) was accessed on 9/5/2022 from https://registry.opendata.aws/allenai-diagrams.
  1. Paper Code
  2. VISUALBERT: A SIMPLE AND PERFORMANT BASELINE FOR VISION AND LANGUAGE, Li et al., 2019
  3. ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision. Kim et al., 2021
  4. LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking. Huang et al., 2022
  5. VisualBERT for Multiple Choice Hugging Face
  6. VisualBERT for Question Answering Hugging Face
  7. VILT for Question Answering Hugging Face
  8. LayoutMV3 Hugging Face
  9. VisualBERT Demo
  10. BERT Multiple Choice Sample
  11. Fine Tuning on Multiple Choice Task
  12. Hugging Face
  13. PyTorch AdamW Optimizer
  14. PyTorch Cross Entropy
  15. Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dolla ́r, and C Lawrence Zitnick. Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
  16. Tryolabs
  17. A Diagram is Worth a Dozen Images

Licensing

  • MIT License