CoLLaVO: Crayon Large Language and Vision mOdel [arxiv]
- CoLLaVO is now available in 🤗Huggingface Space.
- CoLLaVO is featured by Huggingface Daily Papers.
- A new model, MoAI is either released in [Paper]/[Github]/[Huggingface].
- Code is public (Only Inference Supported).
- Downloading CoLLaVO-7B is available in Huggingface.
- Huggingface README.md for simple running
- Short running code for an image example is available.
- Uploading GPT-Aided Evaluation
Official PyTorch implementation code for realizing the technical part of Crayon Large Language and Vision mOdel (CoLLaVO) to improve performance of numerous zero-shot vision language tasks. This code is developed on two baseline codes of XDecoder: Generalized Decoding for Pixel, Image, and Language accepted in CVPR 2023 and InternLM for Technical Paper.
The remarkable success of Large Language Models (LLMs) and instruction tuning drives the evolution of Vision Language Models (VLMs) towards a versatile general-purpose model. Yet, it remains unexplored whether current VLMs genuinely possess quality object-level image understanding capabilities determined from 'what objects are in the image?' or 'which object corresponds to a specified bounding box?'. Our findings reveal that the image understanding capabilities of current VLMs are strongly correlated with their zero-shot performance on vision language (VL) tasks. This suggests that prioritizing basic image understanding is crucial for VLMs to excel at VL tasks. To enhance object-level image understanding, we propose Crayon Large Language and Vision mOdel ( CoLLaVO), which incorporates instruction tuning with Crayon Prompt as a new visual prompt tuning scheme based on panoptic color maps. Furthermore, we present a learning strategy of Dual QLoRA to preserve object-level image understanding without forgetting it during visual instruction tuning, thereby achieving a significant leap in numerous VL benchmarks in a zero-shot setting.
Figure. Zero-shot performance of CoLLaVO-7B on challenging VL datasets compared with closed-source VLMs: GPT-4V, Gemini-Pro, Qwen-VL-Plus. Note: The scores of MME are rescaled by 1/20 to match the scales with the accuracies of others. Figure. Demonstrating the efficiency and effectiveness of CoLLaVO compared with those of other VLMs. Note that accuracy is measured on SEED-IMG. Table. Measuring four metrics: Accuracy, Precision, Recall, F1-score on three types of question answering to evaluate hallucination of vision language models: Adversarial, Random, and Popular in POPE.@article{lee2024collavo,
title={CoLLaVO: Crayon Large Language and Vision mOdel},
author={Lee, Byung-Kwan and Park, Beomchan and Kim, Chae Won and Ro, Yong Man},
journal={arXiv preprint arXiv:2402.11248},
year={2024}
}
GQA | SQA-IMG | TextVQA | POPE | MME-P | MME-C | MM-Bench | MMB-CN | MM-Vet | Q-Bench | |
---|---|---|---|---|---|---|---|---|---|---|
BLIP2-13B | 42.4 | 61.0 | 42.5 | 85.3 | 1293.8 | 290.0 | - | - | 22.4 | - |
InstructBLIP-7B | 49.5 | 49.2 | 60.5 | 50.1 | - | - | 36.0 | 23.7 | 25.6 | 56.7 |
Qwen-VL-Chat-7B | 57.5 | 68.2 | 61.5 | - | 1487.5 | 360.7 | 60.6 | 56.7 | - | - |
LLaVA1.5-7B | 62.0 | 66.8 | 58.2 | 85.9 | 1510.7 | 293.8 | 64.3 | 58.3 | 30.5 | 58.7 |
CoLLaVO-7B | 61.4 | 80.7 | 64.2 | 87.2 | 1689.7 | 525.0 | 83.0 | 82.1 | 40.3 | 67.6 |
.
├── asset # Required package lists (Important)
├── trainer # Training CoLLaVO and initializing optimizer (Not Support Now)
├── utils # Michallengeous util files (Not important)
├── collavo # CoLLaVO architecture & loading collavo (Important)
├── pipeline # Evaluating zero-shot vision language tasks (Important)
│
├── datasets # Important
│ ├── dataset_mappers # data parsing including augmentation for loader
│ ├── evaluation # measuring evaluation for each dataset
│ └── registration # register dataset
│
├── configs
│ ├── accel # Accelerate Config files (Support DDP)
│ └── collavo_eval.yaml # Config of evaluating collavo
│
├── modeling # Not Important
│ ├── architectures # training the prototype of collavo (Not Support Now)
│ ├── utils # utils for modeling (Not important)
│ └── BaseModel # loading and saving model
│
├── lbk_entry.py # main code of control tower (Important)
├── run # bash file for running the evaluation (Important)
│
├── install # install required packages (Important)
└── README.md
In bash file of
install
, you should first run the following lines.
conda create -n collavo python=3.9
conda activate collavo
conda clean -a && pip cache purge
conda install pytorch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 pytorch-cuda=11.8 -c pytorch -c nvidia
pip install -r assets/requirements/requirements.txt
pip install -r assets/requirements/requirements_custom.txt
pip install flash-attn --no-build-isolation
In addition, you should set the following environment variables to set the dataset path.
export DETECTRON2_DATASETS=/path/to/dataset
export DATASET=/path/to/dataset
export DATASET2=/path/to/dataset
export VLDATASET=/path/to/dataset
Download CoLLaVO-7B Model and then you can run the demo.py
"""
CoLLaVO-7B
Simple Six Steps
"""
# [1] Loading Image
from PIL import Image
from torchvision.transforms import Resize
from torchvision.transforms.functional import pil_to_tensor
image_path = "figures/crayon_image.jpg"
image = Resize(size=(490, 490), antialias=False)(pil_to_tensor(Image.open(image_path)))
# [2] Instruction Prompt
prompt = "Describe this image in detail"
# [3] Loading CoLLaVO
from collavo.load_collavo import prepare_collavo
collavo_model, collavo_processor, seg_model, seg_processor = prepare_collavo(collavo_path='BK-Lee/CoLLaVO-7B', bits=4, dtype='fp16')
# [4] Pre-processing for CoLLaVO
collavo_inputs = collavo_model.demo_process(image=image,
prompt=prompt,
processor=collavo_processor,
seg_model=seg_model,
seg_processor=seg_processor,
device='cuda:0')
# [5] Generate
import torch
with torch.inference_mode():
generate_ids = collavo_model.generate(**collavo_inputs, do_sample=True, temperature=0.9, top_p=0.95, max_new_tokens=256, use_cache=True)
# [6] Decoding
answer = collavo_processor.batch_decode(generate_ids, skip_special_tokens=True)[0].split('[U')[0]
print(answer)
If you want to valiate zero-shot performances in numerous datasets, then running the bash file 'run'.
# CoLLaVO-Experiment
GPU_DEVICE="0,1,2,3,4,5"
length=${#GPU_DEVICE}
n_gpu=$(((length+1)/2))
main_port=10000
test_batch=1
CUDA_VISIBLE_DEVICES=$GPU_DEVICE \
accelerate launch --config_file configs/accel/ddp_accel.yaml \
--num_processes=$n_gpu \
--main_process_port=$main_port \
lbk_entry.py eval \
--conf_files configs/collavo_eval.yaml \
--overrides \
WANDB False \
DATASETS.TEST mme \
PIPELINE MMEPipeline \
MME.TEST.BATCH_SIZE_TOTAL $((n_gpu * test_batch)) \
SCIENCEQA.TEST.BATCH_SIZE_TOTAL $((n_gpu * test_batch)) \
POPE.TEST.BATCH_SIZE_TOTAL $((n_gpu * test_batch)) \
MMBENCH.TEST.BATCH_SIZE_TOTAL $((n_gpu * test_batch)) \
MMVET.TEST.BATCH_SIZE_TOTAL $((n_gpu * test_batch)) \
AI2D.TEST.BATCH_SIZE_TOTAL $((n_gpu * test_batch)) \
HALLUSIONBENCH.TEST.BATCH_SIZE_TOTAL $((n_gpu * test_batch)) \
MATHVISTA.TEST.BATCH_SIZE_TOTAL $((n_gpu * test_batch)) \
QBENCH.TEST.BATCH_SIZE_TOTAL $((n_gpu * test_batch)) \
SEED.TEST.BATCH_SIZE_TOTAL $((n_gpu * test_batch)) \
SAVE_DIR /path/to/CoLLaVO_DIR \
WEIGHT True \
RESUME_FROM /path/to/CoLLaVO_WEIGHT \
Note that, you should change the two parts to evaluate the dataset you want. (This is very important!!)
DATASETS.TEST
- GQA:
gqa_testdev_balanced
- SQA-IMG:
scienceqa_test
- TextVQA:
textvqa_val
- POPE:
pope_test
- MME:
mme
- MM-Bench:
mmbench
ormmbench_cn
- MM-Vet:
mm-vet
- Q-Bench:
qbench_dev
- MATHVISTA:
mathvista_testmini
- AI2D:
ai2d
- SEED-IMG:
seed
- HallusionBench:
hallusionbench
PIPELINE
- GQA:
GQAPipeline
- SQA-IMG:
SQAPipeline
- TextVQA:
TextVQAPipeline
- POPE:
POPEPipeline
- MME:
MMEPipeline
- MM-Bench:
MMBenchPipeline
- MM-Vet:
MMVetPipeline
- Q-Bench:
QBenchPipeline
- MATHVISTA:
MathVistaPipeline
- AI2D:
AI2DPipeline
- SEED-IMG:
SEEDPipeline
- HallusionBench:
HallusionPipeline
GPT-4 Aid Evalution for AI2D, MM-Vet, SEED-IMG
This code will be soon public!
.
├── GQA # GQA
├── ScienceQA # SQA-IMG
├── TextVQA # TextVQA
├── POPE # POPE
├── MME_Benchmark_release_version # MME
├── MMBench # MM-Bench
├── mm-vet # MM-Vet
├── LLVisionQA-QBench # Q-Bench
├── MathVista # MathVista
├── SEED-Bench # SEED-IMG
├── ai2d # AI2D
└── HallusionBench # HallusionBench