
u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model

Primary LanguagePythonApache License 2.0Apache-2.0


u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model

Multi-modal multi task LLM

· ·

Table of Contents
  1. About The Project
  2. Getting Started
  3. License
  4. Citation
  5. Acknowledgments

About The Project



(back to top)

Demo is coming soon.



  • Epoch Quantitative Evaluation

    • Compute metrics
  • Mixed Datasets

    • Dataset scale specification (portion)
    • Text, Image-Text, Video-Text
  • DeepSpeed

  • LoRA


  • Visual Understanding
    • Image Captioning
    • Video Captioning
    • Visual Question Answering (VQA)
  • Visual Segmentation
    • Referring Expression Segmentation (RES)
    • Salient Object Segmentation
    • Semantic Segmentation
  • Visual Grounding
    • Referring Expression Comprehension (REC)

(back to top)

Model Release

Models Images/Videos
u-LLaVA uLLaVA Stage 2

Getting Started


Run the following commands in terminal:

pip install -r ./shells/requirements.txt
cd ./models/GroundingDINO && ./install.sh && cd ../..

Why do these?

  1. install requirements: pip install -r requirements.txt
  2. build cuda core for GroundingDINO: cd ./models/GroundingDINO && ./install.sh && cd ../.., if not may arise UserWarning: Failed to load custom C++ ops. Running on CPU mode Only! warnings.warn("Failed to load custom C++ ops. Running on CPU mode Only!")


Annotation download link: ullava modified annotations, LLaVA pretrain annotations and LLaVA finetuning annotaions

Image storage (download link can be found in the table):

│  ├─annotations
│  └─images
│  ├─test2014
│  ├─train2014
│  └─val2014
│  ├─annotations
│  ├─train2017
│  └─val2017
│  ├─train2017
│  └─val2017
│  └─images
│  ├─00
│  └─01
    │  └─annotations

where ade20k is extracted from ADEChallengeData2016.zip and cocostuff is extracted from stuffthingmaps_trainval2017.zip, respectively.

Stage I: Pre-training

Dataset Images/Videos Annotations
LLaVA CC3M LLaVA-CC3M-Pretrain-595K/image.zip chat.json
TGIF TGIF - Quark Drive tgif.json

Note: We have renamed the TGIF dataset and removed invalid samples to facilitate training, but please follow the original LICENSE.

Stage II: Fine-tuning

Dataset Images Annotations
LLaVA Instruction 150K coco2017 llava_instruct_150k.json
RefCOCO coco2014 refcoco_train.json
RefCOCOg coco2014 refcocog_train.json
RefCOCO+ coco2014 refcoco+_train.json
RefCLEF saiapr_tc-12 refclef_train.json
ADE20K ade20k ade20k.json
COCO Stuff cocostuff cocostuff.json
VOC2010 voc2010 pascal_part.json
PACO LVIS paco paco_lvis.json
Salient 15K coming soon coming soon

Dataset config example

    data_type: 'image'
    image_token_len: 256
      anno_dir: '/path_to_annotations/llava_instruct_150k.json'
      image_dir: '/path_to_image_root/coco2017/train2017'
      portion: 1.0
    vis_processor: 'clip_image'

    data_type: 'image'
    image_token_len: 256
      anno_dir: '/path_to_annotations/refcoco+_train.json'
      image_dir: '/path_to_image_root/coco2014'
      template_root: './datasets/templates/SEG.json'
      portion: 1.0
    vis_processor: 'clip_image'


  1. We re-organize most of the dataset annotations for easier training, but all of us must follow the rules that the original datasets require.


Stage I: Pre-training

  1. Prepare Open-Source LLaMA models
Foundation model Version Path
Vicuna 7B HF V1.1 vicuna_7b_v1.1
LLaMA2 7B HF - meta-llama/Llama-2-7b-hf
SAM ViT-H sam_vit_h_4b8939.pth
GroundingDINO swint_ogc groundingdino_swint_ogc.pth


- LLaMA2 is trained with bf16, convergence error may happen when stage 1 training with fp16.

- The default tokenizer.legacy of Llama-2 is False, and may rise tokenization mismatch error with some conversation template.

- Errata: The base LLM used in the paper is Vicuna-v1.1, not LLaMA2. Sorry about the mistake.

  1. Prepare datasets
  2. Set config in

Note set all datasets path or output path according to your experiments. 4. Train Stage I with multi GPUs


or python train_ullava_core.py --cfg_path './configs/train/ullava_core_stage1.yaml' for 1 GPU.

The first stage with 4 A100 80G with bf16 costs ~6hours for 1 epoch. Then you can find the trained model at the output_dir, for example, './exp/ullava_core_7b'

Stage II: Fine-tuning

After Stage I training finished, we can go through the following step, that is, fine-tuning.

  1. Prepare datasets
  2. Set config in
configs/train/ullava_stage2_lora.yaml (for lora)
configs/train/ullava_stage2.yaml (for non lora)
  1. Train Stage II with multi GPUs

or python train_ullava.py --cfg_path './configs/train/ullava_stage2_lora.yaml' for 1 GPU.

Common Question

Q1: What conv_tpye used in training?

A1: Stage I: 'conv_simple'. Stage II: 'conv_sep2'

Q2: When LoRA used?

A2: Stage I: We have not used in this stage. Stage II: According to your devices.

(back to top)


Batch evaluation

  1. Set config
configs/eval/eval_res.ymal (for RES task)
configs/eval/eval_rec.ymal (for REC task)
configs/eval/eval_salient.ymal (for Salinet segmentation task)
  1. Run
python evaluation/eval_ullava.py --cfg_path './configs/eval/eval_res.yaml' (for RES)
python evaluation/eval_ullava_grounding.py --cfg_path './configs/eval/eval_rec.yaml' (for REC)
python evaluation/eval_ullava.py --cfg_path './configs/eval/eval_salient.yaml' (for Salinet)

(back to top)

Qualitative inference

Modify the parser in the evaluation/inference_ullava_core.py and evaluation/inference_ullava.py for stage I and stage II, respectively.

python evaluation/eval_ullava.py
python evaluation/eval_ullava_grounding.py 

(back to top)


Distributed under the Apache License. See LICENSE for more information.

(back to top)

(back to top)


  • Visual Segmentation
    • Instance Segmentation

(back to top)


We sincerely thank the open source community for their contributions.

(back to top)

(back to top)