/Muffin

Primary LanguagePython

🥞 Muffin

Multimodal foundation models are found to be “out-of-the-box” multimodal interfaces for LLMs

🔥 Try It Now!🔧 Install 🙌 Build Your Muffin 📄 Our Paper

Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants

News

  • [12/06] We open-source the weights of Muffin trained with more SFT data at hugging face now. The model achieves 80.0 on VQAv2 test-dev split and strong chat ability.
  • [12/04] Our recent work RLHF-V is released now, which is built upon Muffin and achieve SoTA results in preventing hallucination!

Models

Multimodal foundation models (MFMs) are natives of multiple modalities and can serve as better bridegs from different modalities towards LLMs. It is because MFMs can naturally encode the feature from other modalities (such as vision, audio) into the same space as language, which consequently better activate the capability of LLMs. We also list some examples generated by our Muffin to demonstrate the effectiveness.

Datasets

Demonstration of the framework designed for constructing the UniMM-Chat dataset. UniMM-Chat combines various VL datasets to generate knowledge-intensive dialogues. Text spans highlighted in colors indicate different knowledge from origin annotations which are required to answer the questions.

Reformulating-Datasets

Benchmarks

Benchmark-Performance

Examples

We list some representative cases to demonstrate the power of Muffin. We refer readers to our paper for more examples, and you can deploy a web-demo following the instructions.

Outside Knowledge Visual QA

case

Culture

case

Helpful

case

Contents

Data

Pre-training Data

The pre-training data used in this release are all public data include CC-3M, CC-12M, COCO, Visual Genome, LAION-COCO.

Instruction Following Data

We present the UniMM-Chat dataset, which is constructed for visual instruction tuning and expected to be helpful in improving models' capabilities to solve different tasks without harming their generation ability.

During training, we use both the UniMM-Chat and the LLaVA-Instruct-150K dataset during training. To download our langauge-image multimodal instruction-folllowing dataset , please run the following script:

bash ./script/download_data.sh

Install

  1. Clone this repository and navigate to source folder
git clone https://github.com/thunlp/muffin
cd muffin
  1. Download training data and install dependencies.
bash download_data.sh

echo "Creating conda environment"
conda create -n muffin python=3.10
conda activate muffin

echo "Installing dependencies"
pip install -e .

# Install specific version of transformers to make sure you can reproduce the experimental results in our papers
git clone --recursive git@github.com:huggingface/transformers.git
cd transformers
git checkout a92e0ad2e20ef4ce28410b5e05c5d63a5a304e65
pip install .
cd ..

Training

Install additional packages if you need to do training.

git clone --recursive https://github.com/Dao-AILab/flash-attention.git
cd flash-attention

# Uncomment the following line if you have CUDA version <= 11.4
# git checkout ad11394

MAX_JOBS=8 python setup.py install
cd ..

Muffin Weights

We release Muffin weights on Hugging Face. To load Muffin for inference:

from muffin.eval.muffin_vqa import init_muffin

model, image_processor, image_token_len, tokenizer = init_muffin('Yirany/Muffin-13B')

Muffin pretrained weights

We also provice the pretrained Muffin weights (uploading, will be available soon) without training on instruction following data.

Serving

Web UI

Launch a controller

python -m muffin.serve.controller --host 0.0.0.0 --port 10000

Launch a model worker

python -m muffin.serve.muffin_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path Yirany/Muffin-13B --multi-modal

Wait until the process finishes loading the model and you see "Uvicorn running on ...".

Launch a model worker (Multiple GPUs, when GPU VRAM <= 24GB)

If your the VRAM of your GPU is less than 24GB (e.g., RTX 3090, RTX 4090, etc.), you may try running it with multiple GPUs.

python -m muffin.serve.muffin_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path Yirany/Muffin-13B --multi-modal --num-gpus 2

Wait until the process finishes loading the model and you see "Uvicorn running on ...".

Launch a gradio web server.

python -m muffin.serve.gradio_web_server --controller http://localhost:10000

You can open your browser and chat with a model now.

Evaluation

Our GPT-assisted evaluation pipeline for multimodal modeling is provided for a comprehensive understanding of the capabilities of vision-language models. Please see our paper for more details.

  1. Generate responses
bash ./script/eval/eval_muffin_qa.sh your_checkpoint_dir
  1. Evaluate the generated responses.
bash ./script/eval/batch_gpt4_review.sh your_checkpoint_dir
  1. Summarize the evaluation results
python ./eval/summarize_gpt_llava_review.py your_checkpoint_dir
python ./eval/summarize_gpt_unimm-bench_review.py your_checkpoint_dir

Fine-tuning

bash ./script/train/run_unimm-chat.sh ./output/checkpoints master finetune_muffin ./data/coco_images

RLHF

ref_model=./RLHF-V_SFT_weight

bash ./script/train/run_RLHFV.sh \
    ./RLHFV_checkpoints/dpo_exp \
    master \
    RLHFV \
    1.1 \
    $ref_model \
    ./RLHF-V-Dataset \
    RLHFV_SFT \
    2160 \
    360 \
    0.1 \
    False \
    True

Licenses

Code License Data License

Usage and License Notices: The data, code and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of LLaMA, Vicuna and Chat GPT. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.

Acknowledgement

  • LLaVA: the codebase we built upon, and our base model Vicuna-13B that has the amazing language capabilities!

If you find Muffin useful for your your research and applications, please cite using this BibTeX:

@misc{yu2023muffin,
      title={Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants},
      author={Tianyu Yu, Jinyi Hu, Yuan Yao, Haoye Zhang, Yue Zhao, Chongyi Wang, Shan Wang, Yinxv Pan, Jiao Xue, Dahai Li, Zhiyuan Liu, Hai-Tao Zheng, Maosong Sun},
      publisher={arXiv:2310.00653},
      year={2023},
}