Details of interesting multimodal architecture for vision and language
To do: reproducing/writing code and documentation for the following 0. Faster R-CNN, MASK R-CNN,BERT, Masked Language Modelling, Masked Region Modelling, Masked object classification
- Visual Question Answering (VQA)
- Visual Commonsense Reasoning (VCA)
- Image Captioning/ summary
- Image text matching
- Visual Linguistic Matching
- VL BERT
- ViL BERT
- UNICODER
- UNITER
- Visual BERT