Empowering Open-Source Multimodal LLMs with Set-of-Mark Prompting and Improved Visual Reasoning Ability.
List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs [Paper] [HF Model]
📣 Note: Our new dataset is complementary to existing training sources, add it to your train set and boost your multimodal LLMs with Set-of-Mark prompting and improved general capacity! No cost at inference time!
- [04/26] Thanks AK and HF daily papers for featuring our work!
- [04/25] Our paper is on arxiv! [Paper]
- [04/23] Models and datasets of SoM-LLaVA are released! [HF Model] [Dataset]
Method | LLM | POPE | MME | SEED-I | LLaVA-Wild | MM-VET |
BLIP-2 | Vicuna-13B | 85.3 | 1293.8 | 49.7 | 38.1 | 22.4 |
LLaVA-1.5 | Vicuna-13B | 85.9 | 1531.3 | 68.2 | 70.7 | 35.4 |
SoM-LLaVA-1.5 | Vicuna-13B | 86.6 | 1563.1 | 69.6 | 75.3 | 35.9 |
SoM-LLaVA-1.5 w/ tags | Vicuna-13B | 87.0 | 1572.8 | 69.5 | 73.3 | 37.2 |
📣 Note: We get 1% to 6% relative improvements on all benchmarks, by simply adding 30k SoM data to the visual instruction tuning stage of LLaVA. SoM-LLaVA-1.5 w/ tags is to feed the model with tagged images, but you can enjoy the performance gain even without the extra visual prompts at test time!
som_llava_mix695k.json: Full SFT data with llava-665k + SoM-30k
som_listing_coco10k.json: listing all items with SoM images.
som_qa_coco20k.json: QA with SoM images. (Note: QA used the same 10k images from listing, with another batch of 10k added.)
som_train2017.zip: A subset of 20k coco images that is annotated with SoM, used in our data construction.
We release our main model, SoM-LLaVA trained with LLaVA-665k and SoM-style Listing + QA data.
[SoM-LLaVA-v1.5-13B] (model weights in original LLaVA format, load and eval with LLaVA)
[SoM-LLaVA-v1.5-13B-HF] (model weights converted into HF format, see usage below)
Two additional models for ablation study:
We adopt the training code of LLaVA. Please set up environments following the instructions. Currently our data is used in the Visual Instruction Tuning stage.
- Prepare data
Please download the annotation of the final mixture of our instruction tuning data som_llava_mix695k.json , which is a mixture of llava_mix665k and 30k SoM data, and download the images from the following datasets:
- COCO: train2017
- COCO: som_train2017
- GQA: images
- OCR-VQA: download script, we save all files as
.jpg
- TextVQA: train_val_images
- VisualGenome: part1, part2
After downloading all of them, organize the data as follows in your data folder.
├── coco
│ ├── train2017
│ └── som_train2017
├── gqa
│ └── images
├── ocr_vqa
│ └── images
├── textvqa
│ └── train_images
└── vg
├── VG_100K
└── VG_100K_2
- Training
After downloading our data (or preparing your own SoM data), train SoM-LLaVA via command line:
bash scripts/v1_5/finetune.sh
Note: Our implementation is improved over the original SoM repo, by removing overlapping regions for each mask (otherwise there will be confilicts/overlaps for tag positions).
- Init virtual envs
# create env. Note: must use 3.10, 3.11 will cause package conflicts.
conda create -n som python=3.10 -y
conda activate som
- Install libgeos if there is error installing SEEM
sudo apt-get update
sudo apt-get install libgeos-c1v5 libgeos-dev
- Install segmentation packages
# download repo and navigate to SoM folder
git clone https://github.com/zzxslp/SoM-LLaVA.git
cd ~/SoM-LLaVA/SoM/
# install PyTorch
pip3 install torch torchvision torchaudio
# install SEEM
pip install git+https://github.com/UX-Decoder/Segment-Everything-Everywhere-All-At-Once.git@package
# install SAM
pip install git+https://github.com/facebookresearch/segment-anything.git
# install Semantic-SAM
pip install git+https://github.com/UX-Decoder/Semantic-SAM.git@package
# install Deformable Convolution for Semantic-SAM
cd ops && sh make.sh && cd ..
# common error fix:
python -m pip install 'git+https://github.com/MaureenZOU/detectron2-xyz.git'
# install additional packages
pip install datasets
- Download the pretrained models
sh download_ckpt.sh
- Annotate COCO images with SoM
python annotate_coco.py
If you would like to load our model in huggingface, here is an example script:
from PIL import Image
import requests
from transformers import AutoProcessor, LlavaForConditionalGeneration
model_path = "zzxslp/som-llava-v1.5-13b-hf"
model = LlavaForConditionalGeneration.from_pretrained(model_path)
processor = AutoProcessor.from_pretrained(model_path)
prompt = "USER: <image>\nWhat's the content of the image? ASSISTANT:"
url = "https://www.ilankelman.org/stopsigns/australia.jpg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(text=prompt, images=image, return_tensors="pt")
# Generate
generate_ids = model.generate(**inputs, max_new_tokens=20)
output = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
print (output)
Note: to reproduce the results reported in the paper, we recommend using the official LLaVA repo with our LLaVA-format model.
If you find our data or model useful for your research and applications, please cite our paper:
@article{yan2024list,
title={List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs},
author={Yan, An and Yang, Zhengyuan and Wu, Junda and Zhu, Wanrong and Yang, Jianwei and Li, Linjie and Lin, Kevin and Wang, Jianfeng and McAuley, Julian and Gao, Jianfeng and others},
journal={arXiv preprint arXiv:2404.16375},
year={2024}
}
This project is a collaborative work between UC San Diego and Microsoft GenAI, built on top of LLaVA and SoM. Thank the authors for their contributions to the community!