📝 List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs

Empowering Open-Source Multimodal LLMs with Set-of-Mark Prompting and Improved Visual Reasoning Ability.

List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs [Paper] [HF Model]

📣 Note: Our new dataset is complementary to existing training sources, add it to your train set and boost your multimodal LLMs with Set-of-Mark prompting and improved general capacity! No cost at inference time!

🔥 News

[04/26] Thanks AK and HF daily papers for featuring our work!
[04/25] Our paper is on arxiv! [Paper]
[04/23] Models and datasets of SoM-LLaVA are released! [HF Model] [Dataset]

📊 Results

Method	LLM	POPE	MME	SEED-I	LLaVA-Wild	MM-VET
BLIP-2	Vicuna-13B	85.3	1293.8	49.7	38.1	22.4
LLaVA-1.5	Vicuna-13B	85.9	1531.3	68.2	70.7	35.4
SoM-LLaVA-1.5	Vicuna-13B	86.6	1563.1	69.6	75.3	35.9
SoM-LLaVA-1.5 w/ tags	Vicuna-13B	87.0	1572.8	69.5	73.3	37.2

📣 Note: We get 1% to 6% relative improvements on all benchmarks, by simply adding 30k SoM data to the visual instruction tuning stage of LLaVA. SoM-LLaVA-1.5 w/ tags is to feed the model with tagged images, but you can enjoy the performance gain even without the extra visual prompts at test time!

🌱 SoM Dataset

[Training data for SoM-LLaVA]

som_llava_mix695k.json: Full SFT data with llava-665k + SoM-30k

som_listing_coco10k.json: listing all items with SoM images.

som_qa_coco20k.json: QA with SoM images. (Note: QA used the same 10k images from listing, with another batch of 10k added.)

som_train2017.zip: A subset of 20k coco images that is annotated with SoM, used in our data construction.

🍰 Model Checkpoints

We release our main model, SoM-LLaVA trained with LLaVA-665k and SoM-style Listing + QA data.

[SoM-LLaVA-v1.5-13B] (model weights in original LLaVA format, load and eval with LLaVA)

[SoM-LLaVA-v1.5-13B-HF] (model weights converted into HF format, see usage below)

Two additional models for ablation study:

[SoM-LLaVA-v1.5-13B-listing]

[SoM-LLaVA-v1.5-13B-qa]

🍡 Showcases

🍄 Training

We adopt the training code of LLaVA. Please set up environments following the instructions. Currently our data is used in the Visual Instruction Tuning stage.

Prepare data

Please download the annotation of the final mixture of our instruction tuning data som_llava_mix695k.json , which is a mixture of llava_mix665k and 30k SoM data, and download the images from the following datasets:

COCO: train2017
COCO: som_train2017
GQA: images
OCR-VQA: download script, we save all files as .jpg
TextVQA: train_val_images
VisualGenome: part1, part2

After downloading all of them, organize the data as follows in your data folder.

├── coco
│   ├── train2017
│   └── som_train2017
├── gqa
│   └── images
├── ocr_vqa
│   └── images
├── textvqa
│   └── train_images
└── vg
    ├── VG_100K
    └── VG_100K_2

Training

After downloading our data (or preparing your own SoM data), train SoM-LLaVA via command line:

bash scripts/v1_5/finetune.sh

❄️ Using SoM

Note: Our implementation is improved over the original SoM repo, by removing overlapping regions for each mask (otherwise there will be confilicts/overlaps for tag positions).

Init virtual envs

# create env. Note: must use 3.10, 3.11 will cause package conflicts.
conda create -n som python=3.10 -y
conda activate som

Install libgeos if there is error installing SEEM

sudo apt-get update
sudo apt-get install libgeos-c1v5 libgeos-dev

Install segmentation packages

# download repo and navigate to SoM folder
git clone https://github.com/zzxslp/SoM-LLaVA.git
cd ~/SoM-LLaVA/SoM/

# install PyTorch
pip3 install torch torchvision torchaudio

# install SEEM
pip install git+https://github.com/UX-Decoder/Segment-Everything-Everywhere-All-At-Once.git@package
# install SAM
pip install git+https://github.com/facebookresearch/segment-anything.git
# install Semantic-SAM
pip install git+https://github.com/UX-Decoder/Semantic-SAM.git@package
# install Deformable Convolution for Semantic-SAM
cd ops && sh make.sh && cd ..

# common error fix:
python -m pip install 'git+https://github.com/MaureenZOU/detectron2-xyz.git'

# install additional packages
pip install datasets

Download the pretrained models

sh download_ckpt.sh

Annotate COCO images with SoM

python annotate_coco.py

😊 Using LLaVA in HF

If you would like to load our model in huggingface, here is an example script:

from PIL import Image
import requests
from transformers import AutoProcessor, LlavaForConditionalGeneration

model_path = "zzxslp/som-llava-v1.5-13b-hf"

model = LlavaForConditionalGeneration.from_pretrained(model_path)
processor = AutoProcessor.from_pretrained(model_path)

prompt = "USER: <image>\nWhat's the content of the image? ASSISTANT:"
url = "https://www.ilankelman.org/stopsigns/australia.jpg"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(text=prompt, images=image, return_tensors="pt")

# Generate
generate_ids = model.generate(**inputs, max_new_tokens=20)
output = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
print (output)

Note: to reproduce the results reported in the paper, we recommend using the official LLaVA repo with our LLaVA-format model.

🐱 Citation

If you find our data or model useful for your research and applications, please cite our paper:

@article{yan2024list,
  title={List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs},
  author={Yan, An and Yang, Zhengyuan and Wu, Junda and Zhu, Wanrong and Yang, Jianwei and Li, Linjie and Lin, Kevin and Wang, Jianfeng and McAuley, Julian and Gao, Jianfeng and others},
  journal={arXiv preprint arXiv:2404.16375},
  year={2024}
}

🍻 Acknowledgments

This project is a collaborative work between UC San Diego and Microsoft GenAI, built on top of LLaVA and SoM. Thank the authors for their contributions to the community!