Audio-visual large language model (av-LLM) related papers & projects, also extended to multi-modualities containing audio & visual info
- A Survey of Multimodal Large Language Model from A Data-centric Perspective https://arxiv.org/abs/2404.16821
- The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio https://arxiv.org/abs/2410.12787
- A Comprehensive Methodological Survey of Human Activity Recognition Across Divers Data Modalities https://arxiv.org/abs/2409.09678
- InternVideo2: Scaling Foundation Models for Multimodal Video Understanding https://arxiv.org/abs/2403.15377
- SaSR-Net: Source-Aware Semantic Representation Network for Enhancing Audio-Visual Question Answering https://arxiv.org/html/2411.04933v2
- Enhancing Multimodal LLM for Detailed and Accurate Video Captioning using Multi-Round Preference Optimization https://arxiv.org/html/2410.06682v1
- video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models https://arxiv.org/abs/2406.15704
- X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages https://arxiv.org/abs/2305.04160
- Audio-visual training for improved grounding in video-text LLMs https://arxiv.org/abs/2407.15046
- Empowering LLMs with Pseudo-Untrimmed Videos for Audio-Visual Temporal Understanding https://arxiv.org/abs/2403.16276
- Meerkat: Audio-Visual Large Language Model for Grounding in Space and Time https://arxiv.org/abs/2403.16276
- AVicuna: Audio-Visual LLM with Interleaver and Context-Boundary Alignment for Temporal Referential Dialogue https://arxiv.org/html/2403.16276v1
- VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset https://arxiv.org/abs/2305.18500
- VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset https://arxiv.org/abs/2304.08345
- Spoken Moments: Learning Joint Audio-Visual Representations from Video Descriptions https://arxiv.org/abs/2105.04489
- Language Is Not All You Need: Aligning Perception with Language Models https://arxiv.org/abs/2302.14045
- ImageBind: One Embedding Space To Bind Them All https://arxiv.org/abs/2305.05665
- Video-LLaMA:An Instruction-tuned Audio-Visual Language Model for Video Understanding https://arxiv.org/abs/2306.02858
- NExT-GPT: Any-to-Any Multimodal LLM https://arxiv.org/abs/2309.05519
- Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration https://arxiv.org/abs/2306.09093
- Audio-Visual LLM for Video Understanding https://arxiv.org/abs/2312.06720
- GPT-4V(ision) as A Social Media Analysis Engine https://arxiv.org/abs/2311.07547
- PandaGPT: One Model To Instruction-Follow Them All https://arxiv.org/abs/2305.16355
- Auto-ACD: A Large-scale Dataset for Audio-Language https://arxiv.org/abs/2309.11500