This repository aims at providing a GNN-based implementation to reason over both input modalities and improve performance on a VQA dataset. Here, the model receives an image im
and a text-based question t
and outputs the answer to the question t
. The following approach is being followed by the GNN model for the Visual Question Answering task:-
- Processing our input question into a graph Gt and image into a graph Gim using the Graph Parser.
- Passing the text graph Gt and image graph Gim into a graph neural network (GNN) to get the text and image node embeddings.
- Combining the embeddings using the Graph Matcher, which projects the text embeddings into the image embedding space and returns the combined multimodal representation of the input.
- Passing the joint representation through a sequence to sequence model to output the answer to the question.
PyTorch
PyTorch Geometric
Numpy
rsmlkit
- This implementations trains the GNN model on the CLEVR dataset (a diagnostic dataset of
3D
shapes that tests visual and linguistic reasoning), which can be downloaded from here. - To pre-process and prepare the dataset for training, run
Dataset.py
- To see the GNN-based model implementation, check
Model.py
Match.py
is responsible for matching nodes locally via a graph neural network and then updating correspondence scores iteratively- To see RNN-based Encoder-Decoder implementation and how it interacts with GNN model, check
Encoder_Decoder.py
Parser.py
is responsible for instantiating the models as well as taking care of loading and saving of model checkpoints- To train the whole model pipeline, run
Train.py
The predicted answer to each question alongside its corresponsing image can be seen in the following attached output images:-