Awesome-LLMs-for-Video-Understanding

🔥🔥🔥 Video Understanding with Large Language Models: A Survey

Yunlong Tang^1,*, Jing Bi^1,*, Siting Xu^2,*, Luchuan Song¹, Susan Liang¹ , Teng Wang^2,3 , Daoan Zhang¹ , Jie An¹ , Jingyang Lin¹ , Rongyi Zhu¹ , Ali Vosoughi¹ , Chao Huang¹ , Zeliang Zhang¹ , Pinxin Liu¹ , Mingqian Feng¹ , Feng Zheng² , Jianguo Zhang² , Ping Luo³ , Jiebo Luo¹, Chenliang Xu^1,†. (*Core Contributors, †Corresponding Authors)

¹University of Rochester, ²Southern University of Science and Technology, ³The University of Hong Kong

Paper | Project Page

📢 News

[07/23/2024]

📢 We've recently updated our survey: “Video Understanding with Large Language Models: A Survey”!

✨ This comprehensive survey covers video understanding techniques powered by large language models (Vid-LLMs), training strategies, relevant tasks, datasets, benchmarks, and evaluation methods, and discusses the applications of Vid-LLMs across various domains.

🚀 What's New in This Update:
✅ Updated to include around 100 additional Vid-LLMs and 15 new benchmarks as of June 2024.
✅ Introduced a novel taxonomy for Vid-LLMs based on video representation and LLM functionality.
✅ Added a Preliminary chapter, reclassifying video understanding tasks from the perspectives of granularity and language involvement, and enhanced the LLM Background section.
✅ Added a new Training Strategies chapter, removing adapters as a factor for model classification.
✅ All figures and tables have been redesigned.

Multiple minor updates will follow this major update. And the GitHub repository will be gradually updated soon. We welcome your reading and feedback ❤️

Table of Contents

Awesome-LLMs-for-Video-Understanding

Why we need Vid-LLMs?

😎 Vid-LLMs: Models

📑 Citation

If you find our survey useful for your research, please cite the following paper:

@article{vidllmsurvey,
      title={Video Understanding with Large Language Models: A Survey}, 
      author={Tang, Yunlong and Bi, Jing and Xu, Siting and Song, Luchuan and Liang, Susan and Wang, Teng and Zhang, Daoan and An, Jie and Lin, Jingyang and Zhu, Rongyi and Vosoughi, Ali and Huang, Chao and Zhang, Zeliang and Zheng, Feng and Zhang, Jianguo and Luo, Ping and Luo, Jiebo and Xu, Chenliang},
      journal={arXiv preprint arXiv:2312.17432},
      year={2023},
}

🤖 LLM-based Video Agents

Title	Model	Date	Code	Venue
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language	Socratic Models	04/2022	project page	arXiv
Video ChatCaptioner: Towards Enriched Spatiotemporal Descriptions	Video ChatCaptioner	04/2023	code	arXiv
VLog: Video as a Long Document	VLog	04/2023	code	-
ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System	ChatVideo	04/2023	project page	arXiv
MM-VID: Advancing Video Understanding with GPT-4V(ision)	MM-VID	10/2023	-	arXiv
MISAR: A Multimodal Instructional System with Augmented Reality	MISAR	10/2023	project page	ICCV
Grounding-Prompter: Prompting LLM with Multimodal Information for Temporal Sentence Grounding in Long Videos	Grounding-Prompter	12/2023	-	arXiv
NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation	NaVid	02/2024	project page -	RSS
VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding	VideoAgent	03/2024	project page	arXiv

👾 Vid-LLM Pretraining

Title	Model	Date	Code	Venue
Learning Video Representations from Large Language Models	LaViLa	12/2022	code	CVPR
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning	Vid2Seq	02/2023	code	CVPR
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset	VAST	05/2023	code	NeurIPS
Merlin:Empowering Multimodal LLMs with Foresight Minds	Merlin	12/2023	-	arXiv

👀 Vid-LLM Instruction Tuning

Fine-tuning with Connective Adapters

Title	Model	Date	Code	Venue
Video-LLaMA: An Instruction-Finetuned Visual Language Model for Video Understanding	Video-LLaMA	06/2023	code	arXiv
VALLEY: Video Assistant with Large Language model Enhanced abilitY	VALLEY	06/2023	code	-
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models	Video-ChatGPT	06/2023	code	arXiv
Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration	Macaw-LLM	06/2023	code	arXiv
LLMVA-GEBC: Large Language Model with Video Adapter for Generic Event Boundary Captioning	LLMVA-GEBC	06/2023	code	CVPR
Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks	mPLUG-video	06/2023	code	arXiv
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding	MovieChat	07/2023	code	arXiv
Large Language Models are Temporal and Causal Reasoners for Video Question Answering	LLaMA-VQA	10/2023	code	EMNLP
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection	Video-LLaVA	11/2023	code	arXiv
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding	Chat-UniVi	11/2023	code	arXiv
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models	LLaMA-VID	11/2023	code	arXiv
VISTA-LLAMA: Reliable Video Narrator via Equal Distance to Visual Tokens	VISTA-LLAMA	12/2023	-	arXiv
Audio-Visual LLM for Video Understanding	-	12/2023	-	arXiv
AutoAD: Movie Description in Context	AutoAD	06/2023	code	CVPR
AutoAD II: The Sequel - Who, When, and What in Movie Audio Description	AutoAD II	10/2023	-	ICCV
Fine-grained Audio-Visual Joint Representations for Multimodal Large Language Models	FAVOR	10/2023	code	arXiv
VideoLLaMA2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs	VideoLLaMA2	06/2024	code	arXiv

Fine-tuning with Insertive Adapters

Title	Model	Date	Code	Venue
Otter: A Multi-Modal Model with In-Context Instruction Tuning	Otter	06/2023	code	arXiv
VideoLLM: Modeling Video Sequence with Large Language Models	VideoLLM	05/2023	code	arXiv

Fine-tuning with Hybrid Adapters

Title	Model	Date	Code	Venue
VTimeLLM: Empower LLM to Grasp Video Moments	VTimeLLM	11/2023	code	arXiv
GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation	GPT4Video	11/2023	-	arXiv

🦾 Hybrid Methods

Title	Model	Date	Code	Venue
VideoChat: Chat-Centric Video Understanding	VideoChat	05/2023	code demo	arXiv
PG-Video-LLaVA: Pixel Grounding Large Video-Language Models	PG-Video-LLaVA	11/2023	code	arXiv
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding	TimeChat	12/2023	code	CVPR
Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding	Video-GroundingDINO	12/2023	code	arXiv
A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In Zero Shot	Video4096	05/2023		EMNLP

🦾 Training-free Methods

Title	Model	Date	Code	Venue
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models	SlowFast-LLaVA	07/2024	-	arXiv

Tasks, Datasets, and Benchmarks

Recognition and Anticipation

Name	Paper	Date	Link	Venue
Charades	Hollywood in homes: Crowdsourcing data collection for activity understanding	2016	Link	ECCV
YouTube8M	YouTube-8M: A Large-Scale Video Classification Benchmark	2016	Link	-
ActivityNet	ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding	2015	Link	CVPR
Kinetics-GEBC	GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and Retrieval	2022	Link	ECCV
Kinetics-400	The Kinetics Human Action Video Dataset	2017	Link	-
VidChapters-7M	VidChapters-7M: Video Chapters at Scale	2023	Link	NeurIPS

Captioning and Description

Name	Paper	Date	Link	Venue
Microsoft Research Video Description Corpus (MSVD)	Collecting Highly Parallel Data for Paraphrase Evaluation	2011	Link	ACL
Microsoft Research Video-to-Text (MSR-VTT)	MSR-VTT: A Large Video Description Dataset for Bridging Video and Language	2016	Link	CVPR
Tumblr GIF (TGIF)	TGIF: A New Dataset and Benchmark on Animated GIF Description	2016	Link	CVPR
Charades	Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding	2016	Link	ECCV
Charades-Ego	Actor and Observer: Joint Modeling of First and Third-Person Videos	2018	Link	CVPR
ActivityNet Captions	Dense-Captioning Events in Videos	2017	Link	ICCV
HowTo100m	HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips	2019	Link	ICCV
Movie Audio Descriptions (MAD)	MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions	2021	Link	CVPR
YouCook2	Towards Automatic Learning of Procedures from Web Instructional Videos	2017	Link	AAAI
MovieNet	MovieNet: A Holistic Dataset for Movie Understanding	2020	Link	ECCV
Youku-mPLUG	Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks	2023	Link	arXiv
Video Timeline Tags (ViTT)	Multimodal Pretraining for Dense Video Captioning	2020	Link	AACL-IJCNLP
TVSum	TVSum: Summarizing web videos using titles	2015	Link	CVPR
SumMe	Creating Summaries from User Videos	2014	Link	ECCV
VideoXum	VideoXum: Cross-modal Visual and Textural Summarization of Videos	2023	Link	IEEE Trans Multimedia
Multi-Source Video Captioning (MSVC)	VideoLLaMA2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs	2024	Link	arXiv

Grounding and Retrieval

Name	Paper	Date	Link	Venue
Epic-Kitchens-100	Rescaling Egocentric Vision	2021	Link	IJCV
VCR (Visual Commonsense Reasoning)	From Recognition to Cognition: Visual Commonsense Reasoning	2019	Link	CVPR
Ego4D-MQ and Ego4D-NLQ	Ego4D: Around the World in 3,000 Hours of Egocentric Video	2021	Link	CVPR
Vid-STG	Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences	2020	Link	CVPR
Charades-STA	TALL: Temporal Activity Localization via Language Query	2017	Link	ICCV
DiDeMo	Localizing Moments in Video with Natural Language	2017	Link	ICCV

Question Answering

Name	Paper	Date	Link	Venue
MSVD-QA	Video Question Answering via Gradually Refined Attention over Appearance and Motion	2017	Link	ACM Multimedia
MSRVTT-QA	Video Question Answering via Gradually Refined Attention over Appearance and Motion	2017	Link	ACM Multimedia
TGIF-QA	TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering	2017	Link	CVPR
ActivityNet-QA	ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering	2019	Link	AAAI
Pororo-QA	DeepStory: Video Story QA by Deep Embedded Memory Networks	2017	Link	IJCAI
TVQA	TVQA: Localized, Compositional Video Question Answering	2018	Link	EMNLP

Video Instruction Tuning

Pretraining Dataset

Name	Paper	Date	Link	Venue
VidChapters-7M	VidChapters-7M: Video Chapters at Scale	2023	Link	NeurIPS
VALOR-1M	VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset	2023	Link	arXiv
Youku-mPLUG	Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks	2023	Link	arXiv
InternVid	InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation	2023	Link	arXiv
VAST-27M	VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset	2023	Link	NeurIPS

Fine-tuning Dataset

Name	Paper	Date	Link	Venue
MIMIC-IT	MIMIC-IT: Multi-Modal In-Context Instruction Tuning	2023	Link	arXiv
VideoInstruct100K	Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models	2023	Link	arXiv
TimeIT	TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding	2023	Link	CVPR

Video-based Large Language Models Benchmark

Title	Date	Code	Venue
LVBench: An Extreme Long Video Understanding Benchmark	06/2024	code	-
Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models	11/2023	code	-
Perception Test: A Diagnostic Benchmark for Multimodal Video Models	05/2023	code	NeurIPS 2023, ICCV 2023 Workshop
Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks	07/2023	code	-
FETV: A Benchmark for Fine-Grained Evaluation of Open-Domain Text-to-Video Generation	11/2023	code	NeurIPS 2023
MoVQA: A Benchmark of Versatile Question-Answering for Long-Form Movie Understanding	12/2023	code	-
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark	12/2023	code	-
TempCompass: Do Video LLMs Really Understand Videos?	03/2024	code	ACL 2024
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis	06/2024	code	-
VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models	06/2024	code	-

Contributing

We welcome everyone to contribute to this repository and help improve it. You can submit pull requests to add new papers, projects, and helpful materials, or to correct any errors that you may find. Please make sure that your pull requests follow the "Title|Model|Date|Code|Venue" format. Thank you for your valuable contributions!

spidercatfly/Awesome-LLMs-for-Video-Understanding