Multimodal foundation models are found to be “out-of-the-box” multimodal interfaces for LLMs
🔥 Try It Now! • 🔧 Install • 🙌 Build Your Muffin • 📄 Our Paper
Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants
- [12/06] We open-source the weights of Muffin trained with more SFT data at hugging face now. The model achieves 80.0 on VQAv2 test-dev split and strong chat ability.
- [12/04] Our recent work RLHF-V is released now, which is built upon Muffin and achieve SoTA results in preventing hallucination!
Multimodal foundation models (MFMs) are natives of multiple modalities and can serve as better bridegs from different modalities towards LLMs. It is because MFMs can naturally encode the feature from other modalities (such as vision, audio) into the same space as language, which consequently better activate the capability of LLMs. We also list some examples generated by our Muffin to demonstrate the effectiveness.
Demonstration of the framework designed for constructing the UniMM-Chat dataset. UniMM-Chat combines various VL datasets to generate knowledge-intensive dialogues. Text spans highlighted in colors indicate different knowledge from origin annotations which are required to answer the questions.
We list some representative cases to demonstrate the power of Muffin. We refer readers to our paper for more examples, and you can deploy a web-demo following the instructions.
The pre-training data used in this release are all public data include CC-3M, CC-12M, COCO, Visual Genome, LAION-COCO.
We present the UniMM-Chat dataset, which is constructed for visual instruction tuning and expected to be helpful in improving models' capabilities to solve different tasks without harming their generation ability.
During training, we use both the UniMM-Chat and the LLaVA-Instruct-150K dataset during training. To download our langauge-image multimodal instruction-folllowing dataset , please run the following script:
bash ./script/download_data.sh
- Clone this repository and navigate to source folder
git clone https://github.com/thunlp/muffin
cd muffin
- Download training data and install dependencies.
bash download_data.sh
echo "Creating conda environment"
conda create -n muffin python=3.10
conda activate muffin
echo "Installing dependencies"
pip install -e .
# Install specific version of transformers to make sure you can reproduce the experimental results in our papers
git clone --recursive git@github.com:huggingface/transformers.git
cd transformers
git checkout a92e0ad2e20ef4ce28410b5e05c5d63a5a304e65
pip install .
cd ..
Install additional packages if you need to do training.
git clone --recursive https://github.com/Dao-AILab/flash-attention.git
cd flash-attention
# Uncomment the following line if you have CUDA version <= 11.4
# git checkout ad11394
MAX_JOBS=8 python setup.py install
cd ..
We release Muffin weights on Hugging Face. To load Muffin for inference:
from muffin.eval.muffin_vqa import init_muffin
model, image_processor, image_token_len, tokenizer = init_muffin('Yirany/Muffin-13B')
We also provice the pretrained Muffin weights (uploading, will be available soon) without training on instruction following data.
python -m muffin.serve.controller --host 0.0.0.0 --port 10000
python -m muffin.serve.muffin_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path Yirany/Muffin-13B --multi-modal
Wait until the process finishes loading the model and you see "Uvicorn running on ...".
If your the VRAM of your GPU is less than 24GB (e.g., RTX 3090, RTX 4090, etc.), you may try running it with multiple GPUs.
python -m muffin.serve.muffin_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path Yirany/Muffin-13B --multi-modal --num-gpus 2
Wait until the process finishes loading the model and you see "Uvicorn running on ...".
python -m muffin.serve.gradio_web_server --controller http://localhost:10000
Our GPT-assisted evaluation pipeline for multimodal modeling is provided for a comprehensive understanding of the capabilities of vision-language models. Please see our paper for more details.
- Generate responses
bash ./script/eval/eval_muffin_qa.sh your_checkpoint_dir
- Evaluate the generated responses.
bash ./script/eval/batch_gpt4_review.sh your_checkpoint_dir
- Summarize the evaluation results
python ./eval/summarize_gpt_llava_review.py your_checkpoint_dir
python ./eval/summarize_gpt_unimm-bench_review.py your_checkpoint_dir
bash ./script/train/run_unimm-chat.sh ./output/checkpoints master finetune_muffin ./data/coco_images
ref_model=./RLHF-V_SFT_weight
bash ./script/train/run_RLHFV.sh \
./RLHFV_checkpoints/dpo_exp \
master \
RLHFV \
1.1 \
$ref_model \
./RLHF-V-Dataset \
RLHFV_SFT \
2160 \
360 \
0.1 \
False \
True
Usage and License Notices: The data, code and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of LLaMA, Vicuna and Chat GPT. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.
- LLaVA: the codebase we built upon, and our base model Vicuna-13B that has the amazing language capabilities!
If you find Muffin useful for your your research and applications, please cite using this BibTeX:
@misc{yu2023muffin,
title={Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants},
author={Tianyu Yu, Jinyi Hu, Yuan Yao, Haoye Zhang, Yue Zhao, Chongyi Wang, Shan Wang, Yinxv Pan, Jiao Xue, Dahai Li, Zhiyuan Liu, Hai-Tao Zheng, Maosong Sun},
publisher={arXiv:2310.00653},
year={2023},
}