/VCoder

[CVPR 2024] VCoder: Versatile Vision Encoders for Multimodal Large Language Models

Primary LanguagePythonApache License 2.0Apache-2.0

✌️ VCoder: Versatile Vision Encoders for Multimodal Large Language Models

Framework: PyTorch HuggingFace space YouTube

Jitesh Jain, Jianwei Yang, Humphrey Shi

[Project Page] [COST Dataset] [arXiv] [pdf] [Video] [BibTeX]

This repo contains the code for our paper VCoder: Versatile Vision Encoders for Multimodal Large Language Models.

Contents

  1. Installation Instructions
  2. Demo
  3. Dataset Preparation
  4. Getting Started
  5. Results
  6. Citation

News

  • [December 29, 2023]: Our demo is now available on HuggingFace Spaces. Thanks to the HF team for their support! 🤗
  • [December 21, 2023]: Project Page, Dataset, ArXiv Preprint and GitHub Repo are public! 🚀
    • 🎯 VCoder is an adapter for improving MLLMs at object-level perception tasks with the aid of auxiliary perception modalities as control inputs.
    • 🎁 We also release the COST dataset to train and evaluate MLLMs at object-level perception tasks!
    • 🥁 VCoder LLaVA-1.5 and VCoder-DS LLava-1.5 checkpoints are available on HuggingFace Hub!
    • 👨🏻‍💻 [COMING SOON] VCoder (IT) LLaVA-1.5 trained on a mix of instruction-tuning data and COST dataset!

Installation Instructions

We use Python 3.10 and PyTorch 2.0.1 (CUDA 11.7 build) on Ubuntu 20.04.3 LTS.

  • Clone this repository.

    git clone https://github.com/SHI-Labs/VCoder
    cd VCoder
  • Setup conda environment.

    conda create -n vcoder python=3.10 -y
    conda activate vcoder
    pip install --upgrade pip
    conda install -c "nvidia/label/cuda-11.7.0" cuda-toolkit
    conda install pytorch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 pytorch-cuda=11.7 -c pytorch -c nvidia
    pip install -e .
    pip install ninja
    pip install flash-attn --no-build-isolation
  • Install additional packages for evaluation.

    python -m spacy download en_core_web_sm
    pip install --user -U nltk

Demo

HuggingFace space

You can use one of the CLI or Gradio interface to interact with VCoder LLaVA-1.5 locally.

Note: You can obtain the segmentation map from the OneFormer Demo and the depth map from DINOv2.

Gradio Interface

Run the following command:

CUDA_VISIBLE_DEVICES=0 python -m vcoder_llava.serve.gradio_app --model-path shi-labs/vcoder_ds_llava-v1.5-13b

CLI Inference

Run the following command:

CUDA_VISIBLE_DEVICES=0 python -m vcoder_llava.serve.cli \
    --model-path shi-labs/vcoder_ds_llava-v1.5-13b \
    --image-file "vcoder_llava/serve/examples/suits.jpg" \
    --seg-image-file "vcoder_llava/serve/examples/suits_pan.png" \ # optional [reqd with depth input]
    --depth-image-file "vcoder_llava/serve/examples/suits_depth.jpeg" \ # optional
    --load-4bit # optional, you may also use --load-8bit

Getting Started

Please see Getting Started with VCoder for training and evaluation commands.

Results

Note that we do not finetune any parameters in the original LLaVA-1.5 models, so VCoder's performance on general question answering benchmarks is the same as LLaVA-1.5 .

Benchmarking on COST

Model Semantic Instance Panoptic Depth Checkpoint
CS(↑)/HS(↓) CS(↑)/HS(↓) CS(↑)/HS(↓) DS(↓)
VCoder LLaVA-1.5-7b 88.6/10.4 71.1/26.9 86.0/12.8 - HF Hub
VCoder LLaVA-1.5-13b 89.0/10.0 73.3/25.0 87.2/11.6 - HF Hub
VCoder-DS LLaVA-1.5-7b 87.8/11.5 69.9/28.5 86.8/12.4 65.9 HF Hub
VCoder-DS LLaVA-1.5-13b 88.5/10.9 71.7/26.3 88.5/10.8 63.3 HF Hub

We release the model responses used for benchmarking here.

Citation

If you found VCoder useful in your research, please consider starring ⭐ us on GitHub and citing 📚 us in your research!

@article{jain2023vcoder,
    title={{VCoder: Versatile Vision Encoders for Multimodal Large Language Models}},
    author={Jitesh Jain and Jianwei Yang and Humphrey Shi},
    journal={arXiv},
    year={2023}
}

Acknowledgement

We thank the authors of LLaVA, OneFormer, and DINOv2 for open-sourcing their codebase and checkpoints. We are also grateful to the authors of CHAIR for releasing their synonym word mapping.