A curated list of Large Language Model ➕ Vision/Speech/Robotic.
CONTENTS
Paper | Base Language Model | Code | Publication | Preprint | Affiliation |
---|---|---|---|---|---|
ViperGPT: Visual Inference via Python Execution for Reasoning | Codex | ViperGPT | 2303.08128 | Columbia | |
ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched Visual Descriptions | ChatGPT, Flan-T5 (BLIP2) | ChatCaptioner | 2303.06594 | KAUST | |
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models | ChatGPT | Visual ChatGPT | 2303.04671 | Microsoft | |
PaLM-E: An Embodied Multimodal Language Model | PaLM | 2303.03378 | |||
Language Is Not All You Need: Aligning Perception with Language Models | Magneto | KOSMOS-1 | 2302.14045 | Microsoft | |
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models | Flan-T5 | BLIP2 | 2301.12597 | Salesforce | |
Language Models are General-Purpose Interfaces | DeepNorm | METALM | 2206.06336 | Microsoft | |
Flamingo: a Visual Language Model for Few-Shot Learning | Chinchilla | Flamingo | NIPS 2022 | 2204.14198 | DeepMind |
Learning Transferable Visual Models From Natural Language Supervision | Bert | CLIP | ICML 2021 | 2103.00020 | OpenAI |
Paper | Base Language Model | Code | Publication | Preprint | Affiliation |
---|---|---|---|---|---|
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning | T5 | Vid2Seq | 2302.14115 | ||
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training | Bert | 2212.14546 | Alibaba | ||
VindLU: A Recipe for Effective Video-and-Language Pretraining | Bert | VindLU | 2212.05051 | UNC | |
SMAUG: Sparse Masked Autoencoder for Efficient Video-Language Pre-training | Bert | 2211.11446 | UW | ||
CLOP: Video-and-Language Pre-Training with Knowledge Regularizations | Roberta | MM 2022 | 2211.03314 | Baidu | |
OmniVL: One Foundation Model for Image-Language and Video-Language Tasks | Bert | NIPS 2022 | 2209.07526 | Microsoft | |
Clover: Towards A Unified Video-Language Alignment and Fusion Model | Bert | Clover | 2207.07885 | Bytedance | |
LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling | Bert-like | LAVENDER | CVPR 2023 | 2206.07160 | Microsoft |
Revealing Single Frame Bias for Video-and-Language Learning | Bert | Singularity | 2206.03428 | UNC | |
Flamingo: a Visual Language Model for Few-Shot Learning | Chinchilla | Flamingo | NIPS 2022 | 2204.14198 | DeepMind |
All in One: Exploring Unified Video-Language Pre-training | Bert-like | All-In-One | CVPR 2023 | 2203.07303 | NUS |
End-to-end Generative Pretraining for Multimodal Video Captioning | Bert+GPT2 | CVPR 2022 | 2201.08264 | ||
Align and Prompt: Video-and-Language Pre-training with Entity Prompts | Bert-like | ALPRO | CVPR 2022 | 2112.09583 | Salesforce |
VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling V2 | Bert | VIOLET | 2111.12681 | Microsoft | |
VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding | Bert | VideoCLIP | EMNLP 2021 | 2109.14084 | |
MERLOT: Multimodal Neural Script Knowledge Models V2 | Roberta | MERLOT | NIPS 2021 | 2106.02636 | AI2 |
VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding | Bert | VLP | ACL Findings 2021 | 2105.09996 | |
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text | Bert-like | NIPS 2021 | 2104.11178 | ||
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval | Bert-like | CLIP4Clip | Neurocomputing 2022 | 2104.08860 | Microsoft |
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval | Bert | Frozen | ICCV 2021 | 2104.00650 | Oxford |
Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling | Bert | ClipBert | CVPR 2021 | 2102.06183 | Microsoft |
ActBERT: Learning Global-Local Video-Text Representations | Bert | ActBert | CVPR 2020 | 2011.07231 | Baidu |
Video Understanding as Machine Translation | T5 | 2006.07203 | |||
HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training | Bert | HERO | EMNLP 2020 | 2005.00200 | Microsoft |
UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation | Bert | UniVL | 2002.06353 | Microsoft | |
Learning Video Representations using Contrastive Bidirectional Transformer | Bert | 1906.05743 | |||
VideoBERT: A Joint Model for Video and Language Representation Learning | Bert | VideoBert (non-official) | ICCV 2019 | 1904.01766 |
Commmonly Used Pretraining Tasks
- Masked Language Modeling (MLM)
- Causal Language Modeling (LM)
- Masked Vision Modeling (MLM)
- Vision = Frame
- Vision = Patch
- VIsion = Object
- Video Language Matching (VLM)
- Video Language Contrastive (VLC)
Commmonly Used Video Corpus for Pretraining
Paper | Video Clips | Duration | Sentences | Domain | Download Link |
---|---|---|---|---|---|
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval | 2.5M | 18s | 2.5M | open | WebVid-2M |
Howto100m: Learning a text-video embedding by watching hundred million narrated video clips | 136M | 4s | 136M | instruction | HowTo100M |
Merlot: Multimodal neural script knowledge models. Advances in Neural Information Processing | 6M | -20m | ~720M | open | YT-Temporal-180M |
Commmonly Used Downsteam Tasks
Task | Paper | Download Link | Publication |
---|---|---|---|
Retrieval | Collecting Highly Parallel Data for Paraphrase Evaluation | MSVD | ACL 2011 |
Retrieval | A Dataset for Movie Description | LSMDC | CVPR 2015 |
Retrieval | MSR-VTT: A Large Video Description Dataset for Bridging Video and Language | MSR-VTT | CVPR 2016 |
Retrieval | Localizing Moments in Video with Natural Language | DiDeMo | ICCV 2017 |
Retrieval | Dense-Captioning Events in Videos | ActivityNet Caption | ICCV 2017 |
Retrieval | Towards Automatic Learning of Procedures from Web Instructional Videos | YouCook2 | AAAI 2018 |
OE QA | TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering | TGIF-Frame | CVPR 2017 |
OE QA | A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering | LSMDC-FiB | CVPR 2017 |
OE QA | Video Question Answering via Gradually Refined Attention over Appearance and Motion | MSRVTT-QA,MSVD-QA | MM 2017 |
OE QA | ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering | ActivityNet-QA | AAAI 2019 |
MC QA | Learning Language-Visual Embedding for Movie Understanding with Natural-Language | LSMDC-MC | |
MC QA | TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering | TGIF-Action, TGIF-Transition | CVPR 2017 |
MC QA | A Joint Sequence Fusion Model for Video Question Answering and Retrieval | MSRVTT-MC | ECCV 2018 |
Caption | Collecting Highly Parallel Data for Paraphrase Evaluation | MSVD | ACL 2011 |
Caption | MSR-VTT: A Large Video Description Dataset for Bridging Video and Language | MSR-VTT | CVPR 2016 |
Dense Caption | Dense-Captioning Events in Videos | ActivityNet Caption | ICCV 2017 |
Dense Caption | Towards Automatic Learning of Procedures from Web Instructional Videos | YouCook2 | AAAI 2018 |
Dense Caption | Multimodal Pretraining for Dense Video Captioning | ViTT | AACL 2020 |
Please freely create pull request or drop me an email