If our project helps you, please give us a star ⭐ and cite our paper!
- [2024/10/11] 🔥 We have released technical report of Baichuan-Omni. See here!
The salient multimodal capabilities and interactive experience of GPT-4o highlight its critical role in practical applications, yet it lacks a high-performing open-source counterpart.
In this paper, we introduce Baichuan-Omni, the first high-performing open-source Multimodal Large Language Model (MLLM) adept at concurrently processing and analyzing modalities of image, video, audio, and text, while delivering an advanced multimodal interactive experience. We propose an effective multimodal training schema starting with 7B model and proceeding through two stages of multimodal alignment and multitask fine-tuning across audio, image, video, and text modal.
This approach equips the language model with the ability to handle visual and audio data effectively. Demonstrating strong performance across various omni-modal and multimodal benchmarks, we aim for this contribution to serve as a competitive baseline for the open-source community in advancing multimodal understanding and real-time interaction.
The training pipeline is divided into two phases:
Phase 1: Multimodal Alignment Pretraining. The pre-training and alignment processes consist of Image-Language, Video-Language, and Audio-Language branches.
- The Image-Language branch utilizes a visual encoder to process images and undergoes training in three stages, focusing on image captioning, visual question answering tasks, and further enhancing alignment with the large language model (LLM).
- The Video-Language branch builds on the Image-Language branch, using the same visual encoder and video projector, and trains with a mix of image-text and video-text pairs.
- The Audio-Language branch incorporates the Whisper-large-v3 model's audio encoder and a novel convolutional-gated MLP projector, replacing traditional pooling methods to retain more audio information. Finally, the Omni-Alignment stage integrates all branches, training on high-quality multimodal data pairs to enhance the model's multimodal understanding capabilities.
Phase 2: Multimodal Supervised Fine-Tuning. The multimodal supervised fine-tuning process designed to enhance the model's ability to follow complex instructions across various tasks, utilizing a dataset of over 600K pairs spanning multiple modalities.
- The text-only data encompasses a wide range of tasks, emphasizing complex, multi-step instructions.
- For image understanding, we employed the vFLAN dataset, applying a loss-based filtering method to improve data quality and translating the data into Chinese for better alignment.
- The video understanding component is sourced from the VideoInstruct100K dataset, where we implemented semantic deduplication and translation to diversify the instructions.
- In the audio understanding part, we generated audio using text-to-speech technology, ensuring quality through transcription verification and supplementing it with human-recorded samples to capture diverse dialects.
We conducted a comprehensive evaluation across various modalities, including language, image understanding, video understanding, and audio understanding. The results indicate that Baichuan-Omni demonstrates superior capabilities in omni-modal comprehension.
short-term_memory_OCR.mp4
math.mp4
Coming soon & Stay tuned! 😉
@article{li2024baichuanomni,
title={Baichuan-Omni Technical Report},
author={Li, Yadong and Sun, Haoze and Lin, Mingan and Li, Tianpeng and Dong, Guosheng and Zhang, Tao and Ding, Bowen and Song, Wei and Cheng, Zhenglin and Huo, Yuqi and Chen, Song and Li, Xu and Pan, Da and Zhang, Shusen and Wu, Xin and Liang, Zheng and Liu, Jun and Zhang, Tao and Lu, Keer and Zhao, Yaqi and Shen, Yanjun and Yang, Fan and Yu, Kaicheng and Lin, Tao and Xu, Jianhua and Zhou, Zenan and Chen, Weipeng},
journal={arXiv preprint arXiv:2410.08565},
year={2024}
}