/u-LLaVA

u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model

Primary LanguagePythonApache License 2.0Apache-2.0


Logo

u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model

Multi-modal multi task LLM
Documentation | 中文文档

Paper · Report Bug · Request Feature

🎉 News

Table of Contents
  1. About The Project
  2. Results
  3. Getting Started
  4. License
  5. Citation
  6. Acknowledgments

About The Project

Structure:

Examples

(back to top)

Demo is coming soon.

Features

Code

  • Epoch Quantitative Evaluation

    • Compute metrics
  • Mixed Datasets

    • Dataset scale specification (portion)
    • Text, Image-Text, Video-Text
  • DeepSpeed

  • LoRA

Task

  • Visual Understanding
    • Image Captioning
    • Video Captioning
    • Visual Question Answering (VQA)
  • Visual Segmentation
    • Referring Expression Segmentation (RES)
    • Salient Object Segmentation
    • Semantic Segmentation
  • Visual Grounding
    • Referring Expression Comprehension (REC)

(back to top)

Model Release

Models Images/Videos
u-LLaVA uLLaVA Stage 2

RESULTS

RES

REC

SALIENT

General MLLM

Fine-tune ScienceQA MM-Bench Seed-Bench
u-LLaVA-7B 87.74 soon soon

Video QA

zero-shot Accuracy (Type 3)
Activity-QA 51.70%

Getting Started

Requirements

Run the following commands in terminal:

pip install -r ./shells/requirements.txt
cd ./models/GroundingDINO && ./install.sh && cd ../..

Why do these?

  1. install requirements: pip install -r requirements.txt
  2. build cuda core for GroundingDINO: cd ./models/GroundingDINO && ./install.sh && cd ../.., if not may arise UserWarning: Failed to load custom C++ ops. Running on CPU mode Only! warnings.warn("Failed to load custom C++ ops. Running on CPU mode Only!")

Datasets

Annotation download link: ullava modified annotations, LLaVA pretrain annotations and LLaVA finetuning annotaions

Image storage (download link can be found in the table):

image_root
├─ade20k
│  ├─annotations
│  └─images
├─coco2014
│  ├─test2014
│  ├─train2014
│  └─val2014
├─coco2017
│  ├─annotations
│  ├─train2017
│  └─val2017
├─cocostuff
│  ├─train2017
│  └─val2017
├─LLaVA-CC3M-Pretrain-595K
│  └─images
├─saiapr_tc-12
│  ├─00
│  └─01
└─vlpart
    ├─paco
    │  └─annotations
    └─pascal-part
        ├─Annotations_Part
        ├─examples
        └─VOCdevkit

where ade20k is extracted from ADEChallengeData2016.zip and cocostuff is extracted from stuffthingmaps_trainval2017.zip, respectively.

Stage I: Pre-training

Dataset Images/Videos Annotations
LLaVA CC3M LLaVA-CC3M-Pretrain-595K/image.zip chat.json
TGIF TGIF - Quark Drive tgif.json

Note: We have renamed the TGIF dataset and removed invalid samples to facilitate training, but please follow the original LICENSE.

Stage II: Fine-tuning

Dataset Images Annotations
LLaVA Instruction 150K coco2017 llava_instruct_150k.json
RefCOCO coco2014 refcoco_train.json
RefCOCOg coco2014 refcocog_train.json
RefCOCO+ coco2014 refcoco+_train.json
RefCLEF saiapr_tc-12 refclef_train.json
ADE20K ade20k ade20k.json
COCO Stuff cocostuff cocostuff.json
VOC2010 voc2010 pascal_part.json
PACO LVIS paco paco_lvis.json
Salient 15K msra ullava_salinet_15k.json

Note: Please download the images of MSRA-10K and MSRA-B from the official site, thanks the authors for sharing.

Dataset config example

dataset:
  llava:
    data_type: 'image'
    image_token_len: 256
    build_info:
      anno_dir: '/path_to_annotations/llava_instruct_150k.json'
      image_dir: '/path_to_image_root/coco2017/train2017'
      portion: 1.0
    vis_processor: 'clip_image'

  refcoco+:
    data_type: 'image'
    image_token_len: 256
    build_info:
      anno_dir: '/path_to_annotations/refcoco+_train.json'
      image_dir: '/path_to_image_root/coco2014'
      template_root: './datasets/templates/SEG.json'
      portion: 1.0
    vis_processor: 'clip_image'

Note:

  1. We re-organize most of the dataset annotations for easier training, but all of us must follow the rules that the original datasets require.

Training

Stage I: Pre-training

  1. Prepare Open-Source LLaMA models
Foundation model Version Path
Vicuna 7B HF V1.1 vicuna_7b_v1.1
LLaMA2 7B HF - meta-llama/Llama-2-7b-hf
SAM ViT-H sam_vit_h_4b8939.pth
GroundingDINO swint_ogc groundingdino_swint_ogc.pth

Note:

- LLaMA2 is trained with bf16, convergence error may happen when stage 1 training with fp16.

- The default tokenizer.legacy of Llama-2 is False, and may rise tokenization mismatch error with some conversation template.

- Errata: The base LLM used in the paper is Vicuna-v1.1, not LLaMA2. Sorry about the mistake.

  1. Prepare datasets
  2. Set config in
configs/train/ullava_core_stage1.yaml

Note set all datasets path or output path according to your experiments. 4. Train Stage I with multi GPUs

./shells/pretrain.sh

or python train_ullava_core.py --cfg_path './configs/train/ullava_core_stage1.yaml' for 1 GPU.

The first stage with 4 A100 80G with bf16 costs ~6hours for 1 epoch. Then you can find the trained model at the output_dir, for example, './exp/ullava_core_7b'

Stage II: Fine-tuning

After Stage I training finished, we can go through the following step, that is, fine-tuning.

  1. Prepare datasets
  2. Set config in
configs/train/ullava_stage2_lora.yaml (for lora)
configs/train/ullava_stage2.yaml (for non lora)
  1. Train Stage II with multi GPUs
./shells/finetune.sh

or python train_ullava.py --cfg_path './configs/train/ullava_stage2_lora.yaml' for 1 GPU.

Common Question

Q1: What conv_tpye used in training?

A1: Stage I: 'conv_simple'. Stage II: 'conv_sep2'

Q2: When LoRA used?

A2: Stage I: We have not used in this stage. Stage II: According to your devices.

(back to top)

Evaluation

Batch evaluation

  1. Set config
configs/eval/eval_res.ymal (for RES task)
configs/eval/eval_rec.ymal (for REC task)
configs/eval/eval_salient.ymal (for Salinet segmentation task)
  1. Run
python evaluation/eval_ullava.py --cfg_path './configs/eval/eval_res.yaml' (for RES)
python evaluation/eval_ullava_grounding.py --cfg_path './configs/eval/eval_rec.yaml' (for REC)
python evaluation/eval_ullava.py --cfg_path './configs/eval/eval_salient.yaml' (for Salinet)

(back to top)

Qualitative inference

Modify the parser in the evaluation/inference_ullava_core.py and evaluation/inference_ullava.py for stage I and stage II, respectively.

python evaluation/eval_ullava.py
python evaluation/eval_ullava_grounding.py 

(back to top)

License

Distributed under the Apache License. See LICENSE for more information.

(back to top)

Citation

@inproceedings{xu2024ullava,
  title={u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model},
  author={Xu, Jinjin and Xu, Liwu and Yang, Yuzhe and Li, Xiang and Wang, Fanyi and Xie, Yanchun and Huang, Yi-Jie and Li, Yaqian},
  booktitle={Proceedings of the 27th European Conference on Artificial Intelligence},
  year={2024}
}

(back to top)

TODO

  • Visual Segmentation
    • Instance Segmentation

(back to top)

Acknowledgments

We sincerely thank the open source community for their contributions. And this work is sponsored by Shanghai Pujiang Program (23PJ1421800).

(back to top)

See the open issues for a full list of proposed features (and known issues).

(back to top)