Macaw-LLM: Multi-Modal Language Modeling with Image, Video, Audio, and Text Integration 🌐🖼️📹🎵📝

¹ Chenyang Lyu, ² Bingshuai Liu, ³ Minghao Wu, ⁴ Zefeng Du,

⁵ Xinting Huang, ⁵ Zhaopeng Tu, ⁵ Shuming Shi, ⁵ Longyue Wang

¹ Dublin City University, ² Xiamen University, ³ Monash University, ⁴ University of Macau, ⁵ Tencent AI Lab

Macaw-LLM is an exploratory endeavor that pioneers multi-modal language modeling by seamlessly combining image, video, audio, and text data, built upon the foundations of CLIP, Whisper, and LLaMA.

Introduction 📖

In recent years, the field of language modeling has witnessed remarkable advancements. However, the integration of multiple modalities, such as images, videos, audios, and text, has remained a challenging task. Macaw-LLM is a model of its kind, bringing together state-of-the-art models for processing visual, auditory, and textual information, namely CLIP, Whisper, and LLaMA.

Key Features 🔑

Macaw-LLM boasts the following unique features:

Simple & Fast Alignment: Macaw-LLM enables seamless integration of multi-modal data through simple and fast alignment to LLM embeddings. This efficient process ensures quick adaptation of diverse data types.
One-Stage Instruction Fine-Tuning: Our model streamlines the adaptation process through one-stage instruction fine-tuning, promoting a more efficient learning experience.

Architecture 🔧

Macaw-LLM is composed of three main components:

CLIP: Responsible for encoding images and video frames.
Whisper: Responsible for encoding audio data.
LLM(LLaMA/Vicuna/Bloom): The language model that encodes instructions and generates responses.

The integration of these models allows Macaw-LLM to process and analyze multi-modal data effectively.

Alignment Strategy 📏

Our novel alignment strategy enables faster adaptation by efficiently bridging multi-modal features to textual features. The process involves:

Encoding multi-modal features with CLIP and Whisper.
Feeding the encoded features into an attention function, wherein the multi-modal features serve as the query and the embedding matrix of LLaMA as the key and value.
Injecting the outputs into the input sequence (before instruction tokens) of LLaMA, allowing for a streamlined alignment process with minimal additional parameters.

Installation 💻

To install Macaw-LLM, follow these steps:

# Clone the repository
git clone https://github.com/lyuchenyang/Macaw-LLM.git

# Change to the Macaw-LLM directory
cd Macaw-LLM

# Install required packages
pip install -r requirements.txt

# Install ffmpeg
yum install ffmpeg -y

# Install apex
git clone https://github.com/NVIDIA/apex.git
cd apex
python setup.py install
cd ..

Usage 🚀

Downloading dataset:
- Text data: stanford_alpaca/alpaca_data.json
- Image data: COCO Dataset
- Video data: Charades and Video Dialog
Dataset preprocessing:
- Place the data in three modalities to specific folders - data/text/, data/image/, data/video/
- Extract frames and audio from videos:
```
python preprocess_data.py
```
- Transform supervised data to dataset:
```
python preprocess_data_supervised.py
```
- Transform unsupervised data to dataset:
```
python preprocess_data_unsupervised.py
```
Training:
- Execute the training script (you can specify the training parameters inside):
```
./train.sh
```
Inference:
- Execute the inference script (you can give any customized inputs inside):
```
./inference.sh
```

Future Work and Contributions 🚀

While our model is still in its early stages, we believe that Macaw-LLM paves the way for future research in the realm of multi-modal language modeling. The integration of diverse data modalities holds immense potential for pushing the boundaries of artificial intelligence and enhancing our understanding of complex real-world scenarios. By introducing Macaw-LLM, we hope to inspire further exploration and innovation in this exciting area of study.

We welcome contributions from the community to improve and expand Macaw-LLM's capabilities. 🤝

ToDo 👨‍💻

More Language Models: We aim to extend Macaw-LLM by incorporating additional language models like Dolly, BLOOM, T-5, etc. This will enable more robust and versatile processing and understanding of multi-modal data.
Multilingual Support: Our next step is to support multiple languages, moving towards true multi-modal and multilingual language models. We believe this will significantly broaden Macaw-LLM's applicability and enhance its understanding of diverse, global contexts.

Citation

@misc{Macaw-LLM,
  author = {Chenyang Lyu and Bingshuai Liu and Minghao Wu and Zefeng Du and Longyue Wang},
  title = {Macaw-LLM: Multi-Modal Language Modeling with Image, Video, Audio, and Text Integration},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/lyuchenyang/Macaw-LLM}},
}

Zth9730/Macaw-LLM