/Awesome-Colorful-LLM

Learn the colorful world (Vision/Speech/Robotic) from LLM

MIT LicenseMIT

Awesome-Colorful Large Language Model Awesome

A curated list of Large Language Model ➕ Vision/Speech/Robotic.

CONTENTS

VISION

Image Language Model

Reading List

Paper Base Language Model Code Publication Preprint Affiliation
ViperGPT: Visual Inference via Python Execution for Reasoning Codex ViperGPT 2303.08128 Columbia
ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched Visual Descriptions ChatGPT, Flan-T5 (BLIP2) ChatCaptioner 2303.06594 KAUST
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models ChatGPT Visual ChatGPT 2303.04671 Microsoft
PaLM-E: An Embodied Multimodal Language Model PaLM 2303.03378 Google
Language Is Not All You Need: Aligning Perception with Language Models Magneto KOSMOS-1 2302.14045 Microsoft
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models Flan-T5 BLIP2 2301.12597 Salesforce
Language Models are General-Purpose Interfaces DeepNorm METALM 2206.06336 Microsoft
Flamingo: a Visual Language Model for Few-Shot Learning Chinchilla Flamingo NIPS 2022 2204.14198 DeepMind
Learning Transferable Visual Models From Natural Language Supervision Bert CLIP ICML 2021 2103.00020 OpenAI

Video Language Model

Reading List

Paper Base Language Model Code Publication Preprint Affiliation
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning T5 Vid2Seq 2302.14115 Google
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training Bert 2212.14546 Alibaba
VindLU: A Recipe for Effective Video-and-Language Pretraining Bert VindLU 2212.05051 UNC
SMAUG: Sparse Masked Autoencoder for Efficient Video-Language Pre-training Bert 2211.11446 UW
CLOP: Video-and-Language Pre-Training with Knowledge Regularizations Roberta MM 2022 2211.03314 Baidu
OmniVL: One Foundation Model for Image-Language and Video-Language Tasks Bert NIPS 2022 2209.07526 Microsoft
Clover: Towards A Unified Video-Language Alignment and Fusion Model Bert Clover 2207.07885 Bytedance
LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling Bert-like LAVENDER CVPR 2023 2206.07160 Microsoft
Revealing Single Frame Bias for Video-and-Language Learning Bert Singularity 2206.03428 UNC
Flamingo: a Visual Language Model for Few-Shot Learning Chinchilla Flamingo NIPS 2022 2204.14198 DeepMind
All in One: Exploring Unified Video-Language Pre-training Bert-like All-In-One CVPR 2023 2203.07303 NUS
End-to-end Generative Pretraining for Multimodal Video Captioning Bert+GPT2 CVPR 2022 2201.08264 Google
Align and Prompt: Video-and-Language Pre-training with Entity Prompts Bert-like ALPRO CVPR 2022 2112.09583 Salesforce
VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling V2 Bert VIOLET 2111.12681 Microsoft
VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding Bert VideoCLIP EMNLP 2021 2109.14084 Facebook
MERLOT: Multimodal Neural Script Knowledge Models V2 Roberta MERLOT NIPS 2021 2106.02636 AI2
VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding Bert VLP ACL Findings 2021 2105.09996 Facebook
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text Bert-like NIPS 2021 2104.11178 Google
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval Bert-like CLIP4Clip Neurocomputing 2022 2104.08860 Microsoft
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval Bert Frozen ICCV 2021 2104.00650 Oxford
Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling Bert ClipBert CVPR 2021 2102.06183 Microsoft
ActBERT: Learning Global-Local Video-Text Representations Bert ActBert CVPR 2020 2011.07231 Baidu
Video Understanding as Machine Translation T5 2006.07203 Facebook
HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training Bert HERO EMNLP 2020 2005.00200 Microsoft
UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation Bert UniVL 2002.06353 Microsoft
Learning Video Representations using Contrastive Bidirectional Transformer Bert 1906.05743 Google
VideoBERT: A Joint Model for Video and Language Representation Learning Bert VideoBert (non-official) ICCV 2019 1904.01766 Google

Pretraining Tasks

Commmonly Used Pretraining Tasks
  • Masked Language Modeling (MLM)
  • Causal Language Modeling (LM)
  • Masked Vision Modeling (MLM)
    • Vision = Frame
    • Vision = Patch
    • VIsion = Object
  • Video Language Matching (VLM)
  • Video Language Contrastive (VLC)

Datasets

Commmonly Used Video Corpus for Pretraining
Paper Video Clips Duration Sentences Domain Download Link
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval 2.5M 18s 2.5M open WebVid-2M
Howto100m: Learning a text-video embedding by watching hundred million narrated video clips 136M 4s 136M instruction HowTo100M
Merlot: Multimodal neural script knowledge models. Advances in Neural Information Processing 6M -20m ~720M open YT-Temporal-180M
Commmonly Used Downsteam Tasks
Task Paper Download Link Publication
Retrieval Collecting Highly Parallel Data for Paraphrase Evaluation MSVD ACL 2011
Retrieval A Dataset for Movie Description LSMDC CVPR 2015
Retrieval MSR-VTT: A Large Video Description Dataset for Bridging Video and Language MSR-VTT CVPR 2016
Retrieval Localizing Moments in Video with Natural Language DiDeMo ICCV 2017
Retrieval Dense-Captioning Events in Videos ActivityNet Caption ICCV 2017
Retrieval Towards Automatic Learning of Procedures from Web Instructional Videos YouCook2 AAAI 2018
OE QA TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering TGIF-Frame CVPR 2017
OE QA A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering LSMDC-FiB CVPR 2017
OE QA Video Question Answering via Gradually Refined Attention over Appearance and Motion MSRVTT-QA,MSVD-QA MM 2017
OE QA ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering ActivityNet-QA AAAI 2019
MC QA Learning Language-Visual Embedding for Movie Understanding with Natural-Language LSMDC-MC
MC QA TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering TGIF-Action, TGIF-Transition CVPR 2017
MC QA A Joint Sequence Fusion Model for Video Question Answering and Retrieval MSRVTT-MC ECCV 2018
Caption Collecting Highly Parallel Data for Paraphrase Evaluation MSVD ACL 2011
Caption MSR-VTT: A Large Video Description Dataset for Bridging Video and Language MSR-VTT CVPR 2016
Dense Caption Dense-Captioning Events in Videos ActivityNet Caption ICCV 2017
Dense Caption Towards Automatic Learning of Procedures from Web Instructional Videos YouCook2 AAAI 2018
Dense Caption Multimodal Pretraining for Dense Video Captioning ViTT AACL 2020

Tutorials

Other Curated Lists

Speech

Other Curated Lists

Robotic

Other Curated Lists

Related

Contributing

Please freely create pull request or drop me an email