MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens

Online Demo

Overview

This paper introduces MiniGPT4-Video, a multimodal Large Language Model (LLM) designed specifically for video understanding. The model is capable of processing both temporal visual and textual data, making it adept at understanding the complexities of videos. Building upon the success of MiniGPT-v2, which excelled in translating visual features into the LLM space for single images and achieved impressive results on various image-text benchmarks, this paper extends the model's capabilities to process a sequence of frames, enabling it to comprehend videos. MiniGPT4-video does not only consider visual content but also incorporates textual conversations, allowing the model to effectively answer queries involving both visual and text components. The proposed model outperforms existing state-of-the-art methods, registering gains of 4.22%, 1.13%, 20.82%, and 13.1% on the MSVD, MSRVTT, TGIF, and TVQA benchmarks respectively. During inference, a speech to text model such as Whisper model is utilized to generate subtitles for the video. Then, both the video and the subtitle are input to the MiniGPT4-Video model with the instruction and the model outputs the answer.

🚀 Demo

1. Clone the repository

git clone https://github.com/Vision-CAIR/MiniGPT4-video.git
cd MiniGPT4-video

2. Set up the environment

conda env create -f environment.yml

3. Download the checkpoints

MiniGPT4-Video (Llama2 Chat 7B)	MiniGPT4-Video (Mistral 7B)
Download	Download

4. Run the demo

# Llama2
python minigpt4_video_demo.py --ckpt path_to_video_checkpoint --cfg-path test_configs/llama2_test_config.yaml
# Mistral
python minigpt4_video_demo.py --ckpt path_to_video_checkpoint --cfg-path test_configs/mistral_test_config.yaml

Inference

Do the previous steps and replace step 4 with this step

# Llama2
python minigpt4_video_inference.py --ckpt path_to_video_checkpoint --cfg-path test_configs/llama2_test_config.yaml --video_path path_to_video --question "Your question here" 
# Mistral
python minigpt4_video_inference.py --ckpt path_to_video_checkpoint --cfg-path test_configs/mistral_test_config.yaml --video_path path_to_video --question "Your question here"

🔥 Training

To customize MiniGPT4-Video for your own Video-text dataset

You can find the steps to customize MiniGPT4-Video for your own video-text dataset in Custom_training.md

Training datasets

After downloading the datasets below, you should go to the datasets configuration folder here minigpt4/configs/datasets set the paths for each dataset there.
Image text training
You can find the steps to download the datasets in MiniGPT4

LAION
Conceptual Captions
SBU

Video text training:

You can find the datasets annotation files for video_text datasets here download

Model training:

You can edit the number of gpus in the each script.sh below

Stage 1 (image text pretraining)

You can directly download the pretrained MiniGPT4 checkpoint aligned with Llama2.

Or train by yourself:

# pretrain
# Llama2
torchrun --nproc-per-node NUM_GPU train.py --cfg-path train_configs/224_minigpt4_llama2_image.yaml
# Mistral
torchrun --nproc-per-node NUM_GPU train.py --cfg-path train_configs/224_minigpt4_mistral_image.yaml

# align
# To launch the second stage alignment, first specify the path to the checkpoint file trained in pretrain stage.
# Llama2
torchrun --nproc-per-node NUM_GPU train.py --cfg-path train_configs/224_minigpt4_llama2_image_align.yaml
# Mistral
torchrun --nproc-per-node NUM_GPU train.py --cfg-path train_configs/224_minigpt4_mistral_image_align.yaml

You can download our trained weights for this stage from here Llama2 Mistral

Stage 2 (video captioning pretraining)

For Llama2
set the cfg-path in the script to train_configs/224_v2_llama2_video_stage_2.yaml
set the model name here minigpt4/configs/datasets/cmd_video/default.yaml and minigpt4/configs/datasets/webvid/default.yaml to llama2
For Mistral
set the cfg-path in the script to train_configs/224_v2_mistral_video_stage_2.yaml
set the model name here minigpt4/configs/datasets/cmd_video/default.yaml and minigpt4/configs/datasets/webvid/default.yaml to mistral

bash jobs_video/train/stage_2.sh

You can download our trained weights for this stage from here Llama2 Mistral

Stage 3 (video Instruction finetuning)

For Llama2
set the cfg-path in the script to train_configs/224_v2_llama2_video_stage_3.yaml
set the model name here minigpt4/configs/datasets/video_chatgpt/default.yaml to llama2

For Mistral
set the cfg-path in the script to train_configs/224_v2_mistral_video_stage_3.yaml
set the model name here minigpt4/configs/datasets/video_chatgpt/default.yaml to mistral

bash jobs_video/train/stage_3.sh

You can download our trained weights for this stage from here Llama2 Mistral

⚡ Evaluation

To reproduce the results use the best checkpoints for each model
Llama2 Mistral
We used the same evaluation as Video-ChatGPT

Method	Using Subtitles	Information Correctness	Detailed Orientation	Contextual Understanding	Temporal Understanding	Consistency
LLaMA Adapter	❌	2.03	2.32	2.30	1.98	2.15
Video LLaMA	❌	1.96	2.18	2.16	1.82	1.79
Video Chat	❌	2.23	2.50	2.53	1.94	2.24
Video-ChatGPT	❌	2.40	2.52	2.62	1.98	2.37
BT-Adapter-7B	❌	2.68	2.69	3.27	2.34	2.46
LLaMA-VID-7B	❌	2.96	3.00	3.53	2.46	2.51
Ours-7B Llama2	❌	2.93	2.97	3.45	2.47	2.60
Ours-7B Llama2	✅	3.08	3.02	3.57	2.65	2.67
Ours-7B Mistral	❌	2.83	2.52	3.01	2.32	2.40
Ours-7B Mistral	✅	2.91	2.57	3.11	2.33	2.39

Method	Using Subtitles	MSVD Acc.↑	MSVD Score↑	MSRVTT Acc.↑	MSRVTT Score↑	TGIF Acc.↑	TGIF Score↑	ActivityNet Acc.↑	ActivityNet Score↑	TVQA Acc.↑
FrozenBiLM	❌	32.2	--	16.8	--	41	--	24.7	--	29.7
LLaMA Adapter	❌	54.9	3.1	43.8	2.7	--	--	34.2	2.7	--
Video LLaMA	❌	51.6	2.5	29	1.8	--	--	12.4	1.1	--
Video Chat	❌	56.3	2.8	45	2.5	34.4	2.3	26.5	2.2	--
Video-ChatGPT	❌	64.9	3.3	49.3	2.8	51.4	3.0	35.2	2.7	23.35
BT-Adapter-7B	❌	67.7	3.7	57	3.2	--	--	45.7	3.2	--
LLaMA-VID-7B	❌	69.7	3.7	57.7	3.2	--	--	47.4	3.3	--
Ours-7B LLama2	❌	72.93	3.84	58.83	3.29	67.9	3.71	45.85	3.23	36.45
Ours-7B Llama2	✅	72.93	3.84	59.73	3.3	67.9	3.71	46.3	3.4	46.94
Ours-7B Mistral	❌	73.92	4.06	58.26	3.52	72.22	4.08	44.25	3.35	33.90
Ours-7B Mistral	✅	73.92	4.06	58.68	3.53	72.22	4.08	44.38	3.36	54.21

Download datasets for evaluation

You can find the evaluation datasets annotation files download

Run evaluation script

Set the each evaluation script parameters to include the path to the checkpoints, the dataset name and whether to use subtitles or not

# Llama2
bash jobs_video/eval/llama2_evaluation.sh
# Mistral
bash jobs_video/eval/mistral_evalualtion.sh

Then Use GPT3.5 turbo to compare the predictions with the ground truth and generate the accuracy and scores
Set these variables in both evaluate_benchmark.sh and evaluate_zeroshot.sh

PRED="path_to_predictions"
OUTPUT_DIR="path_to_output_dir"
API_KEY="openAI_key"
NUM_TASKS=128

Then to evaluate [Video-ChatGPT benchmark] run the following script

bash test_benchmark/quantitative_evaluation/evaluate_benchmark.sh

To evaluate open ended questions run the following script

bash test_benchmark/quantitative_evaluation/evaluate_zeroshot.sh

If you're using MiniGPT4-Video in your research or applications, please cite using this BibTeX:

@article{ataallah2024minigpt4,
  title={MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens},
  author={Ataallah, Kirolos and Shen, Xiaoqian and Abdelrahman, Eslam and Sleiman, Essam and Zhu, Deyao and Ding, Jian and Elhoseiny, Mohamed},
  journal={arXiv preprint arXiv:2404.03413},
  year={2024}
}

Acknowledgements

MiniGPT4
Video-ChatGPT

License

This repository is under BSD 3-Clause License. Many codes are based on MiniGPT4.

DylanHooz/MiniGPT4-video

MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens

Online Demo

Overview

🚀 Demo

Inference

🔥 Training

To customize MiniGPT4-Video for your own Video-text dataset

Training datasets

Model training:

Stage 1 (image text pretraining)

Stage 2 (video captioning pretraining)

Stage 3 (video Instruction finetuning)

⚡ Evaluation

Download datasets for evaluation

Run evaluation script

Acknowledgements

License