Multi Modal Models for vision and language tasks

Details of interesting multimodal architecture for vision and language

To do: reproducing/writing code and documentation for the following 0. Faster R-CNN, MASK R-CNN,BERT, Masked Language Modelling, Masked Region Modelling, Masked object classification

Visual Question Answering (VQA)
Visual Commonsense Reasoning (VCA)
Image Captioning/ summary
Image text matching
Visual Linguistic Matching
VL BERT
ViL BERT
UNICODER
UNITER
Visual BERT

Srigowri/MultiModal-Vision-NLP

Multi Modal Models for vision and language tasks