zchoi/Vision-and-Language-Benchmark
Codebase for research of vision&language, including various multimodal task pipline (e.g., image captioning, VQA, video-text retrieval), customizable dataset (e.g., MS-COCO, ActivityNet, MSR-VTT), pre-trained model acquire (e.g., CLIP, BLIP-2)