/Awesome-LLMs-for-Video-Understanding

๐Ÿ”ฅ๐Ÿ”ฅ๐Ÿ”ฅLatest Papers, Codes and Datasets on Vid-LLMs.

Awesome-LLMs-for-Video-Understanding Awesome

Yunlong Tang1,*, Jing Bi1,*, Siting Xu2,*, Luchuan Song1, Susan Liang1 , Teng Wang2,3 , Daoan Zhang1 , Jie An1 , Jingyang Lin1 , Rongyi Zhu1 , Ali Vosoughi1 , Chao Huang1 , Zeliang Zhang1 , Pinxin Liu1 , Mingqian Feng1 , Feng Zheng2 , Jianguo Zhang2 , Ping Luo3 , Jiebo Luo1, Chenliang Xu1,โ€ . (*Core Contributors, โ€ Corresponding Authors)

1University of Rochester, 2Southern University of Science and Technology, 3The University of Hong Kong

image

๐Ÿ“ข News

[07/23/2024]

๐Ÿ“ข We've recently updated our survey: โ€œVideo Understanding with Large Language Models: A Surveyโ€!

โœจ This comprehensive survey covers video understanding techniques powered by large language models (Vid-LLMs), training strategies, relevant tasks, datasets, benchmarks, and evaluation methods, and discusses the applications of Vid-LLMs across various domains.

๐Ÿš€ What's New in This Update:
โœ… Updated to include around 100 additional Vid-LLMs and 15 new benchmarks as of June 2024.
โœ… Introduced a novel taxonomy for Vid-LLMs based on video representation and LLM functionality.
โœ… Added a Preliminary chapter, reclassifying video understanding tasks from the perspectives of granularity and language involvement, and enhanced the LLM Background section.
โœ… Added a new Training Strategies chapter, removing adapters as a factor for model classification.
โœ… All figures and tables have been redesigned.

Multiple minor updates will follow this major update. And the GitHub repository will be gradually updated soon. We welcome your reading and feedback โค๏ธ

Table of Contents

Why we need Vid-LLMs?

image

๐Ÿ˜Ž Vid-LLMs: Models

image

๐Ÿ“‘ Citation

If you find our survey useful for your research, please cite the following paper:

@article{vidllmsurvey,
      title={Video Understanding with Large Language Models: A Survey}, 
      author={Tang, Yunlong and Bi, Jing and Xu, Siting and Song, Luchuan and Liang, Susan and Wang, Teng and Zhang, Daoan and An, Jie and Lin, Jingyang and Zhu, Rongyi and Vosoughi, Ali and Huang, Chao and Zhang, Zeliang and Zheng, Feng and Zhang, Jianguo and Luo, Ping and Luo, Jiebo and Xu, Chenliang},
      journal={arXiv preprint arXiv:2312.17432},
      year={2023},
}

๐Ÿค– LLM-based Video Agents

Title Model Date Code Venue
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language Socratic Models 04/2022 project page arXiv
Video ChatCaptioner: Towards Enriched Spatiotemporal DescriptionsStar Video ChatCaptioner 04/2023 code arXiv
VLog: Video as a Long DocumentStar VLog 04/2023 code -
ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System ChatVideo 04/2023 project page arXiv
MM-VID: Advancing Video Understanding with GPT-4V(ision) MM-VID 10/2023 - arXiv
MISAR: A Multimodal Instructional System with Augmented RealityStar MISAR 10/2023 project page ICCV
Grounding-Prompter: Prompting LLM with Multimodal Information for Temporal Sentence Grounding in Long Videos Grounding-Prompter 12/2023 - arXiv
NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation NaVid 02/2024 project page - RSS
VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding VideoAgent 03/2024 project page arXiv

๐Ÿ‘พ Vid-LLM Pretraining

Title Model Date Code Venue
Learning Video Representations from Large Language ModelsStar LaViLa 12/2022 code CVPR
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning Vid2Seq 02/2023 code CVPR
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and DatasetStar VAST 05/2023 code NeurIPS
Merlin:Empowering Multimodal LLMs with Foresight Minds Merlin 12/2023 - arXiv

๐Ÿ‘€ Vid-LLM Instruction Tuning

Fine-tuning with Connective Adapters

Title Model Date Code Venue
Video-LLaMA: An Instruction-Finetuned Visual Language Model for Video Understanding Star Video-LLaMA 06/2023 code arXiv
VALLEY: Video Assistant with Large Language model Enhanced abilitYStar VALLEY 06/2023 code -
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language ModelsStar Video-ChatGPT 06/2023 code arXiv
Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text IntegrationStar Macaw-LLM 06/2023 code arXiv
LLMVA-GEBC: Large Language Model with Video Adapter for Generic Event Boundary Captioning Star LLMVA-GEBC 06/2023 code CVPR
Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks Star mPLUG-video 06/2023 code arXiv
MovieChat: From Dense Token to Sparse Memory for Long Video UnderstandingStar MovieChat 07/2023 code arXiv
Large Language Models are Temporal and Causal Reasoners for Video Question AnsweringStar LLaMA-VQA 10/2023 code EMNLP
Video-LLaVA: Learning United Visual Representation by Alignment Before ProjectionStar Video-LLaVA 11/2023 code arXiv
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video UnderstandingStar Chat-UniVi 11/2023 code arXiv
LLaMA-VID: An Image is Worth 2 Tokens in Large Language ModelsStar LLaMA-VID 11/2023 code arXiv
VISTA-LLAMA: Reliable Video Narrator via Equal Distance to Visual Tokens VISTA-LLAMA 12/2023 - arXiv
Audio-Visual LLM for Video Understanding - 12/2023 - arXiv
AutoAD: Movie Description in Context AutoAD 06/2023 code CVPR
AutoAD II: The Sequel - Who, When, and What in Movie Audio Description AutoAD II 10/2023 - ICCV
Fine-grained Audio-Visual Joint Representations for Multimodal Large Language ModelsStar FAVOR 10/2023 code arXiv
VideoLLaMA2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMsStar VideoLLaMA2 06/2024 code arXiv

Fine-tuning with Insertive Adapters

Title Model Date Code Venue
Otter: A Multi-Modal Model with In-Context Instruction TuningStar Otter 06/2023 code arXiv
VideoLLM: Modeling Video Sequence with Large Language ModelsStar VideoLLM 05/2023 code arXiv

Fine-tuning with Hybrid Adapters

Title Model Date Code Venue
VTimeLLM: Empower LLM to Grasp Video MomentsStar VTimeLLM 11/2023 code arXiv
GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation GPT4Video 11/2023 - arXiv

๐Ÿฆพ Hybrid Methods

Title Model Date Code Venue
VideoChat: Chat-Centric Video UnderstandingStar VideoChat 05/2023 code demo arXiv
PG-Video-LLaVA: Pixel Grounding Large Video-Language ModelsStar PG-Video-LLaVA 11/2023 code arXiv
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video UnderstandingStar TimeChat 12/2023 code CVPR
Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video GroundingStar Video-GroundingDINO 12/2023 code arXiv
A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In Zero Shot Video4096 05/2023 EMNLP

๐Ÿฆพ Training-free Methods

Title Model Date Code Venue
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models SlowFast-LLaVA 07/2024 - arXiv

Tasks, Datasets, and Benchmarks

Recognition and Anticipation

Name Paper Date Link Venue
Charades Hollywood in homes: Crowdsourcing data collection for activity understanding 2016 Link ECCV
YouTube8M YouTube-8M: A Large-Scale Video Classification Benchmark 2016 Link -
ActivityNet ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding 2015 Link CVPR
Kinetics-GEBC GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and Retrieval 2022 Link ECCV
Kinetics-400 The Kinetics Human Action Video Dataset 2017 Link -
VidChapters-7M VidChapters-7M: Video Chapters at Scale 2023 Link NeurIPS

Captioning and Description

Name Paper Date Link Venue
Microsoft Research Video Description Corpus (MSVD) Collecting Highly Parallel Data for Paraphrase Evaluation 2011 Link ACL
Microsoft Research Video-to-Text (MSR-VTT) MSR-VTT: A Large Video Description Dataset for Bridging Video and Language 2016 Link CVPR
Tumblr GIF (TGIF) TGIF: A New Dataset and Benchmark on Animated GIF Description 2016 Link CVPR
Charades Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding 2016 Link ECCV
Charades-Ego Actor and Observer: Joint Modeling of First and Third-Person Videos 2018 Link CVPR
ActivityNet Captions Dense-Captioning Events in Videos 2017 Link ICCV
HowTo100m HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips 2019 Link ICCV
Movie Audio Descriptions (MAD) MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions 2021 Link CVPR
YouCook2 Towards Automatic Learning of Procedures from Web Instructional Videos 2017 Link AAAI
MovieNet MovieNet: A Holistic Dataset for Movie Understanding 2020 Link ECCV
Youku-mPLUG Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks 2023 Link arXiv
Video Timeline Tags (ViTT) Multimodal Pretraining for Dense Video Captioning 2020 Link AACL-IJCNLP
TVSum TVSum: Summarizing web videos using titles 2015 Link CVPR
SumMe Creating Summaries from User Videos 2014 Link ECCV
VideoXum VideoXum: Cross-modal Visual and Textural Summarization of Videos 2023 Link IEEE Trans Multimedia
Multi-Source Video Captioning (MSVC) VideoLLaMA2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs 2024 Link arXiv

Grounding and Retrieval

Name Paper Date Link Venue
Epic-Kitchens-100 Rescaling Egocentric Vision 2021 Link IJCV
VCR (Visual Commonsense Reasoning) From Recognition to Cognition: Visual Commonsense Reasoning 2019 Link CVPR
Ego4D-MQ and Ego4D-NLQ Ego4D: Around the World in 3,000 Hours of Egocentric Video 2021 Link CVPR
Vid-STG Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences 2020 Link CVPR
Charades-STA TALL: Temporal Activity Localization via Language Query 2017 Link ICCV
DiDeMo Localizing Moments in Video with Natural Language 2017 Link ICCV

Question Answering

Name Paper Date Link Venue
MSVD-QA Video Question Answering via Gradually Refined Attention over Appearance and Motion 2017 Link ACM Multimedia
MSRVTT-QA Video Question Answering via Gradually Refined Attention over Appearance and Motion 2017 Link ACM Multimedia
TGIF-QA TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering 2017 Link CVPR
ActivityNet-QA ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering 2019 Link AAAI
Pororo-QA DeepStory: Video Story QA by Deep Embedded Memory Networks 2017 Link IJCAI
TVQA TVQA: Localized, Compositional Video Question Answering 2018 Link EMNLP

Video Instruction Tuning

Pretraining Dataset
Name Paper Date Link Venue
VidChapters-7M VidChapters-7M: Video Chapters at Scale 2023 Link NeurIPS
VALOR-1M VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset 2023 Link arXiv
Youku-mPLUG Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks 2023 Link arXiv
InternVid InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation 2023 Link arXiv
VAST-27M VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset 2023 Link NeurIPS
Fine-tuning Dataset
Name Paper Date Link Venue
MIMIC-IT MIMIC-IT: Multi-Modal In-Context Instruction Tuning 2023 Link arXiv
VideoInstruct100K Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models 2023 Link arXiv
TimeIT TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding 2023 Link CVPR

Video-based Large Language Models Benchmark

Title Date Code Venue
LVBench: An Extreme Long Video Understanding Benchmark 06/2024 code -
Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models 11/2023 code -
Perception Test: A Diagnostic Benchmark for Multimodal Video Models 05/2023 code NeurIPS 2023, ICCV 2023 Workshop
Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks Star 07/2023 code -
FETV: A Benchmark for Fine-Grained Evaluation of Open-Domain Text-to-Video Generation Star 11/2023 code NeurIPS 2023
MoVQA: A Benchmark of Versatile Question-Answering for Long-Form Movie Understanding 12/2023 code -
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark 12/2023 code -
TempCompass: Do Video LLMs Really Understand Videos? Star 03/2024 code ACL 2024
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis Star 06/2024 code -
VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models Star 06/2024 code -

Contributing

We welcome everyone to contribute to this repository and help improve it. You can submit pull requests to add new papers, projects, and helpful materials, or to correct any errors that you may find. Please make sure that your pull requests follow the "Title|Model|Date|Code|Venue" format. Thank you for your valuable contributions!

๐ŸŒŸ Star History

Star History Chart

โ™ฅ๏ธ Contributors