/UMOE-Scaling-Unified-Multimodal-LLMs

The codes about "Uni-MoE: Scaling Unified Multimodal Models with Mixture of Experts"

Primary LanguagePython

If you appreciate our project, please consider giving us a star ⭐ on GitHub to stay updated with the latest developments.

🚀 Welcome to the repo of Uni-MOE!

Uni-MoE is a MoE-based unified multimodal model and can handle diverse modalities including audio, speech, image, text, and video.

Project Page Demo Paper

Yunxin Li, Shenyuan Jiang, Baotian Hu, Longyue Wang, Wanqi Zhong, Wenhan Luo, Lin Ma, Min Zhang

🔥 News

  • [4/28] 🔥 We have upgraded the Uni-MoE codebase to facilitate training across multiple Nodes and GPUs. Explore this enhanced functionality in our revamped fine-tuning script. Additionally, we have introduced a version that integrates distributed MoE modules. This enhancement allows for training our model with parallel processing at both the expert and modality levels, enhancing efficiency and scalability. For more details, please refer to the Uni_MoE_v2 documentation.
  • [3/7] 🔥 We released Uni-MOE: Scaling Unified Multimodal LLMs with Mixture of Experts. We proposed the development of a unified Multimodal LLM (MLLM) utilizing the MoE framework, which can process diverse modalities, including audio, image, text, and video. Checkout the paper and demo.

Usage and License Notices: The data and checkpoint are intended and licensed for research use only. They are also restricted to uses that follow the license agreement of LLaMA and Vicuna. The dataset and models trained using the dataset should not be used outside of research purposes.

🎨 Case Show

📀 Demo Video

Demo 2 contains the real-time understanding of speech (Starting from 30S).

demo1.mp4
demo2.mp4

🌟 Structure

The model architecture of Uni-MoE is shown below. Three training stages contain: 1) Utilize pairs from different modalities and languages to build connectors that map these elements to a unified language space, establishing a foundation for multimodal understanding; 2) Develop modality-specific experts using cross-modal data to ensure deep understanding, preparing for a cohesive multi-expert model; 3) Incorporate multiple trained experts into LLMs and refine the unified multimodal model using the LoRA technique on mixed multimodal data.

⚡️ Install

The following instructions are for Linux installation. We would like to recommend the requirements as follows.

  • Python == 3.9.16
  • CUDA Version >= 11.7
  1. Clone this repository and navigate to the Uni-MoE folder
git clone https://github.com/HITsz-TMG/UMOE-Scaling-Unified-Multimodal-LLMs.git
cd UMOE-Scaling-Unified-Multimodal-LLMs/Uni_MoE
  1. Install Package
conda create -n unimoe python==3.9.16
conda activate unimoe
pip install -r env.txt
  1. Replace all the absolute pathnames '/path/to/' with your specific path to the Uni-MoE file

⚡️ Uni-MOE Weights

To use our model, all weights should be downloaded.

After downloading all of them, organize the weights as follows in 'Uni_MoE/checkpoint' folder:

└── checkpoint
    ├── Uni-MoE-audio-base
    ├── Uni-MoE-audio-e2
    ├── Uni-MoE-speech-base
    ├── Uni-MoE-speech-e2
    ├── Uni-MoE-speech-base-interval
    ├── Uni-MoE-speech-v1.5
    ├── clip-vit-large-patch14-336
    ├── whisper-small
    └── BEATs_iter3_plus_AS2M.pt
Model Checkpoint
vision encoder CLIP ViT-L/14 336px
speech encoder whisper small
audio encoder Fine-tuned BEATs_iter3+ (AS2M)
Uni-MoE-audio-base-model Uni-MoE/Uni-MoE-audio-base
Uni-MoE-audio-fine-tuned-chekpoint Uni-MoE/Uni-MoE-audio-e2
Uni-MoE-speech-base-model Uni-MoE/Uni-MoE-speech-base
Uni-MoE-speech-fine-tuned-chekpoint Uni-MoE/Uni-MoE-speech-e2
Uni-MoE-speech-base-interval Uni-MoE/Uni-MoE-speech-base-interval
Uni-MoE-speech-v1.5 Uni-MoE/Uni-MoE-speech-v1.5
  • Uni-MoE-speech refers to the MOE-Task2 and Uni-MoE-audio refers to the MOE-Task3 in our paper.
  • 'Uni-MoE-base' is the backbone containing LLMs and trained parameters gained from Training Stage 2: Training Modality-Specific Expert.

🗝️ Dataset

Training Data

DataSet Type
LLaVA-Instruct-150K imgae(train2014)
Video-Instruct-Dataset video(from youtube)
WavCaps audio
AudioCaps audio(Cap)
ClothoAQA audio(QA)
ClothoV1 audio(Cap)
MELD audio(Music)
RACE Speech(TTS)
LibriSpeech Speech(Long)

We use TTS technical to convert long text to speech to construct long speech understanding data.

Overall, all training tasks (16 comparative experiments covering models with single-expert and MoE configurations) are as follows:

Training Tasks Data Types Data Size Epochs Trainable Modules Pretraining tasks
Audio-Language Pretraining WaveCaps*, Audiocap*, MELD, ClothoV1 194K 2 Audio Q-former, Audio projection layer -
Speech-Language Pretraining Common Voice (Short Speech) 1.7M 2 Speech Q-former, Speech projection layer -
Single-Modality-Expert-Task1 LLaVA-Instruction-150K(I-A) 150K 1 LoRA, Speech projection layer Speech-pretrain-task
Single-Modality-Expert-Task2 LLaVA-Instruction-150K(T-I) 150K 1 LoRA, Image projection layer Speech-pretrain-task
Single-Modality-Expert-Task3 LLaVA-Instruction-150K(I-A) 150K 1 LoRA, Speech Q-former, Speech and Image projection layer Speech-pretrain-task
Single-Modality-Expert-Task4 LLaVA-Instruction-150K(I-A), RACE(T-A), LibriSpeech 271K 1 LoRA, Speech & Image projection Speech-pretrain-task
Single-Modality-Expert-Task5 LLaVA-Instruction-150K(T-I), RACE(T-A), LibriSpeech 271K 1 LoRA, Speech & Image projection Speech-pretrain-task
Single-Modality-Expert-Task6 LLaVA-Instruction-150K(I-A), LLaVA-Instruction-150K(T-I), RACE(T-A), LibriSpeech 421K 1 LoRA, Speech & Image projection Speech-pretrain-task
Single-Modality-Expert-Task7 RACE(T-A), LibriSpeech, RACE(T-A)-MC 209K 1 LoRA, Speech projection layer Speech-pretrain-task
Single-Modality-Expert-Task8 WaveCaps*, Audiocap*, MELD, ClothoAQA, ClothoV1 203K 1 LoRA, Audio projection layer Audio-pretrain-task
MoE-Task1 LLaVA-Instruction-Dataset(T-I), LLaVA-Instruction-150K(I-A), RACE(T-A), LibriSpeech, RACE(T-A)-MC 509K 3 LoRA, Router, speech & image projection layer LLava-V1.5-LoRA, Single-Modality-Expert-Tasks 2/3/7
MoE-Task1-short-speech LLaVA-Instruction-Dataset(T-I), LLaVA-Instruction-150K(I-A) 300K 3 LoRA, Router, speech & image projection layer LLava-V1.5-LoRA, Single-Modality-Expert-Tasks 2/3/7
MoE-Task2 Video-Instruction-150K, LLaVA-Instruction-Dataset(T-I), RACE(T-A), LibriSpeech, RACE(T-A)-MC 459K 2 LoRA, Router, speech & image projection layer Llava-v1.5-LoRA, Single-Modality-Expert-Tasks 2/3/7
MoE-Task3 Video-Instruction-150K, LLaVA-Instruction-Dataset(T-I), WaveCaps*, Audiocap*, MELD, ClothoAQA, ClothoV1 453K 2 LoRA, Router, audio & image projection layer LLava-V1.5-LoRA, Single-Modality-Expert-Tasks 2/3/8
Pure-MoE-Task1 Video-Instruction-Dataset, LLaVA-Instruction-Dataset(T-I), WaveCaps*, Audiocap*, MELD, ClothoAQA, ClothoV1 453K 2 LoRA, Router, audio & image projection layer LLava-V1.5-LoRA
Pure-MoE-Task2 Video-Instruction-Dataset, LLaVA-Instruction-Dataset(T-I), WaveCaps*, Audiocap*, MELD, ClothoAQA, ClothoV1 453K 2 LoRA, Router, audio & image projection layer -

* refers to the fact that the dataset we use is only a subset. MC represents the multi-choice setting. I-A means image-audio pairs, which convert the question into the corresponding speech. T-I shows the original text-image pairs. T-A indicates the contextual paragraph of the RACE dataset is transferred into the long speech. Pretraining task represents the tasks included in the previous training stage.

Evaluation Data

DataSet Input Type
AOKVQA Text-Image
OKVQA Text-Image
VQAv2 Text-Image
ClothoAQA Text-Audio
ClothoV1 Text-Audio
ClothoV2 Text-Audio
POPE Text-Image
TextVQA Text-Image
MM-Vet Text-Image
SEEDBench(Image) Text-Image
MMBench Text-Image
MMBench-Audio Text-Image-Speech(Long)
English-High-School-Listening Text-Speech(Long)
RACE Text-Speech(Long)
MSVD Text-Video-Audio
Activitynet-QA Text-Video-Audio

College Entrance English Examination Listening Part

We build a real speech understanding dataset to check the practical long speech recognition capabilities: English-High-School-Listening It comprises 150 questions related to long audio segments with an average length of 109 seconds, and 50 questions about short audio segments with an average length of 14 seconds.

🌈 How to infer and deploy your demo

  1. Make sure that all the weights are downloaded and the running environment is set correctly.
  2. run inference scripts inference_audio.sh and inference_speech.sh using bash inference_audio.sh bash inference_speech.sh or run the following commands to inference:
cd /path/to/Uni_MoE
conda activate unimoe
python Uni_MoE_audio/inference_all.py
cd /path/to/Uni_MoE
conda activate unimoe
python Uni_MoE_speech/inference_all.py

To launch the online demo ( It is highly recommended to launch the demo with Uni-MoE-speech-v1.5 that need the basic parameters of Uni-MoE-speech-base-interval), run:

cd /path/to/Uni_MoE
conda activate unimoe
python demo/demo.py
python demo/app.py

🌈 How to train and evaluate on datasets

Training:

  1. Make sure that all the weights are downloaded and the environment is set correctly, especially for the base model.
  2. Our training data can be downloaded from UMOE-Speech-453k.json and UMOE-Cap-453k.json.
  3. Relevant vision and audio files: Dataset
  4. Run training scripts: finetune_audio.sh or finetune_speech.sh using bash finetune_audio.sh bash finetune_speech.sh, remember to modify the training set with your own preference.
  5. For multiple GPUs training, run training scripts: finetune_speech_dp.sh using bash finetune_speech_dp.sh, remember to modify the training set with your own preference.

Evaluation:

  1. Prepare the evaluation set using the form as samples.json.
  2. Run evaluation scripts: eval_audio.sh or eval_speech.sh using bash eval_audio.sh bash eval_speech.sh or run the following commands to eval:
cd /path/to/Uni_MoE
conda activate unimoe
python Uni_MoE_audio/eval.py\
 --data_path /path/to/clotho.json\
 --data_type clothov1\
 --output test.json
cd /path/to/Uni_MoE
conda activate unimoe
python Uni_MoE_speech/eval.py\
 --data_path /path/to/vqa_eval.json\
 --data_type vqa\
 --output test.json

We recommend using 80GB GPU RAM to run all experiments.

Citation

If you find Uni-MoE useful for your research and applications, please cite using this BibTeX:

@article{li2024uni,
  title={Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts},
  author={Li, Yunxin and Jiang, Shenyuan and Hu, Baotian and Wang, Longyue and Zhong, Wanqi and Luo, Wenhan and Ma, Lin and Zhang, Min},
  journal={arXiv preprint arXiv:2405.11273},
  year={2024}
}