/MultiModal-Vision-NLP

Details of interesting multimodal architecture for vision and language

Multi Modal Models for vision and language tasks

Details of interesting multimodal architecture for vision and language

To do: reproducing/writing code and documentation for the following 0. Faster R-CNN, MASK R-CNN,BERT, Masked Language Modelling, Masked Region Modelling, Masked object classification

  1. Visual Question Answering (VQA)
  2. Visual Commonsense Reasoning (VCA)
  3. Image Captioning/ summary
  4. Image text matching
  5. Visual Linguistic Matching
  6. VL BERT
  7. ViL BERT
  8. UNICODER
  9. UNITER
  10. Visual BERT