Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
This is the repo for the Video-LLaMA project, which is working on empowering large language models with video and audio understanding capability.
News
- [06.10] NOTE: we have NOT updated the HF demo yet because the whole framework (with audio branch) cannot run normally on A10-24G. The current running demo is still the previous version of Video-LLaMA. We will fix this issue soon.
- [06.08]
π π Release the checkpoints of the audio-supported Video-LLaMA. Documentation and example outputs are also updated. - [05.22]
π π Interactive demo online, try our Video-LLaMA (with Vicuna-7B as language decoder) at Hugging Face and ModelScope!! - [05.22]
βοΈ Release Video-LLaMA v2 built with Vicuna-7B - [05.18]
π π Support video-grounded chat in Chinese- Video-LLaMA-BiLLA: we introduce BiLLa-7B as language decoder and fine-tune the video-language aligned model (i.e., stage 1 model) with machine-translated VideoChat instructions.
- Video-LLaMA-Ziya: same with Video-LLaMA-BiLLA but the language decoder is changed to Ziya-13B.
- [05.18] βοΈ Create a Hugging Face repo to store the model weights of all the variants of our Video-LLaMA.
- [05.15]
βοΈ Release Video-LLaMA v2: we use the training data provided by VideoChat to further enhance the instruction-following capability of Video-LLaMA. - [05.07] Release the initial version of Video-LLaMA, including its pre-trained and instruction-tuned checkpoints.
Introduction
- Video-LLaMA is built on top of BLIP-2 and MiniGPT-4. It is composed of two core components: (1) Vision-Language (VL) Branch and (2) Audio-Language (AL) Branch.
- VL Branch (Visual encoder: ViT-G/14 + BLIP-2 Q-Former)
- A two-layer video Q-Former and a frame embedding layer (applied to the embeddings of each frame) are introduced to compute video representations.
- We train VL Branch on the Webvid-2M video caption dataset with a video-to-text generation task. We also add image-text pairs (~595K image captions from LLaVA) into the pre-training dataset to enhance the understanding of static visual concepts.
- After pre-training, we further fine-tune our VL Branch using the instruction-tuning data from MiniGPT-4, LLaVA and VideoChat.
- AL Branch (Audio encoder: ImageBind-Huge)
- A two-layer audio Q-Former and a audio segment embedding layer (applied to the embedding of each audio segment) are introduced to compute audio representations.
- As the used audio encoder (i.e., ImageBind) is already aligned across multiple modalities, we train AL Branch on video/image instrucaption data only, just to connect the output of ImageBind to language decoder.
- VL Branch (Visual encoder: ViT-G/14 + BLIP-2 Q-Former)
- Note that only the Video/Audio Q-Former, positional embedding layers and the linear layers are trainable during cross-modal training.
Example Outputs
- Video with background sound
- Video without sound effects
- Static image
Pre-trained & Fine-tuned Checkpoints
The following checkpoints store learnable parameters (positional embedding layers, Video/Audio Q-former and linear projection layers) only.
Vision-Language Branch
Checkpoint | Link | Note |
---|---|---|
pretrain-vicuna7b | link | Pre-trained on WebVid (2.5M video-caption pairs) and LLaVA-CC3M (595k image-caption pairs) |
finetune-vicuna7b-v2 | link | Fine-tuned on the instruction-tuning data from MiniGPT-4, LLaVA and VideoChat |
pretrain-vicuna13b | link | Pre-trained on WebVid (2.5M video-caption pairs) and LLaVA-CC3M (595k image-caption pairs) |
finetune-vicuna13b-v2 | link | Fine-tuned on the instruction-tuning data from MiniGPT-4, LLaVA and VideoChat |
pretrain-ziya13b-zh | link | Pre-trained with Chinese LLM Ziya-13B |
finetune-ziya13b-zh | link | Fine-tuned on machine-translated VideoChat instruction-following dataset (in Chinese) |
pretrain-billa7b-zh | link | Pre-trained with Chinese LLM BiLLA-7B |
finetune-billa7b-zh | link | Fine-tuned on machine-translated VideoChat instruction-following dataset (in Chinese) |
Audio-Language Branch
Checkpoint | Link | Note |
---|---|---|
pretrain-vicuna7b | link | Pre-trained on WebVid (2.5M video-caption pairs) and LLaVA-CC3M (595k image-caption pairs) |
finetune-vicuna7b-v2 | link | Fine-tuned on the instruction-tuning data from MiniGPT-4, LLaVA and VideoChat |
Usage
Enviroment Preparation
First, install ffmpeg.
apt update
apt install ffmpeg
Then, create a conda environment:
conda env create -f environment.yml
conda activate videollama
Prerequisites
Before using the repository, make sure you have obtained the following checkpoints:
Pre-trained Language Decoder
- Get the original LLaMA weights in the Hugging Face format by following the instructions here.
- Download Vicuna delta weights
π [7B][13B]. - Use the following command to add delta weights to the original LLaMA weights to obtain the Vicuna weights:
python apply_delta.py \
--base /path/to/llama-13b \
--target /output/path/to/vicuna-13b --delta /path/to/vicuna-13b-delta
Pre-trained Visual Encoder in Vision-Language Branch
- Download the MiniGPT-4 model (trained linear layer) from this link.
Pre-trained Audio Encoder in Audio-Language Branch
- Download the weight of ImageBind from this link.
Download Learnable Weights
Use git-lfs
to download the learnable weights of our Video-LLaMA (i.e., positional embedding layer + Q-Former + linear projection layer):
git lfs install
git clone https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-Series
The above commands will download the model weights of all the Video-LLaMA variants. For sure, you can choose to download the weights on demand. For example, if you want to run Video-LLaMA with Vicuna-7B as language decoder locally, then:
wget https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-Series/resolve/main/finetune-vicuna7b-v2.pth
wget https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-Series/resolve/main/finetune_vicuna7b_audiobranch.pth
should meet the requirement.
How to Run Demo Locally
Firstly, set the llama_model
, imagebind_ckpt_path
, ckpt
and ckpt_2
in eval_configs/video_llama_eval_withaudio.yaml.
Then run the script:
python demo_audiovideo.py \
--cfg-path eval_configs/video_llama_eval_withaudio.yaml --gpu-id 0
Training
The training of each cross-modal branch (i.e., VL branch or AL branch) in Video-LLaMA consists of two stages,
-
Pre-training on the Webvid-2.5M video caption dataset and LLaVA-CC3M image caption dataset.
-
Fine-tuning using the image-based instruction-tuning data from MiniGPT-4/LLaVA and the video-based instruction-tuning data from VideoChat.
1. Pre-training
Data Preparation
Download the metadata and video following the instruction from the official Github repo of Webvid. The folder structure of the dataset is shown below:
|webvid_train_data
|ββfilter_annotation
|ββββ0.tsv
|ββvideos
|ββββ000001_000050
|ββββββ1066674784.mp4
|cc3m
|ββfilter_cap.json
|ββimage
|ββββGCC_train_000000000.jpg
|ββββ...
Script
Config the the checkpoint and dataset paths in video_llama_stage1_pretrain.yaml. Run the script:
conda activate videollama
torchrun --nproc_per_node=8 train.py --cfg-path ./train_configs/video_llama_stage1_pretrain.yaml
2. Instruction Fine-tuning
Data
For now, the fine-tuning dataset consists of:
- 150K image-based instructions from LLaVA [link]
- 3K image-based instructions from MiniGPT-4 [link]
- 11K video-based instructions from VideoChat [link]
Script
Config the the checkpoint and dataset paths in video_llama_stage2_finetune.yaml.
conda activate videollama
torchrun --nproc_per_node=8 train.py --cfg-path ./train_configs/video_llama_stage2_finetune.yaml
Acknowledgement
We are grateful for the following awesome projects our Video-LLaMA arising from:
- MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models
- FastChat: An Open Platform for Training, Serving, and Evaluating Large Language Model based Chatbots
- BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
- EVA-CLIP: Improved Training Techniques for CLIP at Scale
- ImageBind: One Embedding Space To Bind Them All
- LLaMA: Open and Efficient Foundation Language Models
- VideoChat: Chat-Centric Video Understanding
- LLaVA: Large Language and Vision Assistant
- WebVid: A Large-scale Video-Text dataset
- mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
The logo of Video-LLaMA is generated by Midjourney.
Citation
If you find our project useful, hope you can star our repo and cite our paper as follows:
@article{damonlpsg2023videollama,
author = {Zhang, Hang and Li, Xin and Bing, Lidong},
title = {Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding},
year = 2023,
journal = {arXiv preprint arXiv:2306.02858}
url = {https://arxiv.org/abs/2306.02858}
}