Obsidian is a joint work between Nous Research and Virtual Interactive. Special thanks to LDJ and qnguyen3 for making this work possible.
Easiest way to try out: Colab - After open the Gradio, give the model about 2 minutes to load then refresh the Gradio.
- Install Obsidian
- Clone this project and navigate to the Obsidian folder
git clone https://github.com/NousResearch/Obsidian.git
cd Obsidian
- Download the multimodal projector from Huggingface
sh script/download_mm_projector.sh
- Install packages
conda create -n obsidian python=3.10 -y
conda activate obsidian
pip install --upgrade pip # enable PEP 660 support
pip install -e .
- Install additional packages for training cases (required)
pip install ninja
pip install flash-attn --no-build-isolation
- Install the latest version of
transformers
pip install --upgrade transformers==4.34.0
- Run the Demo UI
python -m llava.serve.controller --host 0.0.0.0 --port 10000
python -m llava.serve.gradio_web_server --controller http://localhost:10000 --model-list-mode reload
You just launched the Gradio web interface. Now, you can open the web interface with the URL printed on the screen. You may notice that there is no model in the model list. Do not worry, as we have not launched any model worker yet. It will be automatically updated when you launch a model worker.
This is the actual worker that performs the inference on the GPU. Each worker is responsible for a single model specified in --model-path
.
python -m llava.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path NousResearch/Obsidian-3B-V0.5
Wait until the process finishes loading the model and you see "Uvicorn running on ...". Now, refresh your Gradio web UI, and you will see the model you just launched in the model list.
Please download the 558K subset of the LAION-CC-SBU dataset with BLIP captions here.
Pretrain takes around 2.5 hours for Obsidian-3B-V0.5 on 4x A100 (80G), at 336px for the vision module.
Training script with DeepSpeed ZeRO-2: pretrain.sh
.
--mm_projector_type mlp2x_gelu
: the two-layer MLP vision-language connector.--vision_tower openai/clip-vit-large-patch14-336
: CLIP ViT-L/14 336px.
Please download the annotation of the final mixture our instruction tuning data llava_v1_5_mix665k.json, and download the images from constituting datasets:
- COCO: train2017
- GQA: images
- OCR-VQA: download script, we save all files as
.jpg
- TextVQA: train_val_images
- VisualGenome: part1, part2
After downloading all of them, organize the data as follows in ./playground/data
,
├── coco
│ └── train2017
├── gqa
│ └── images
├── ocr_vqa
│ └── images
├── textvqa
│ └── train_images
└── vg
├── VG_100K
└── VG_100K_2
Our GPT-assisted evaluation pipeline for multimodal modeling is provided for a comprehensive understanding of the capabilities of vision-language models. Please see our paper for more details.
- Generate LLaVA responses
python model_vqa.py \
--model-path ./checkpoints/LLaVA-13B-v0 \
--question-file \
playground/data/coco2014_val_qa_eval/qa90_questions.jsonl \
--image-folder \
/path/to/coco2014_val \
--answers-file \
/path/to/answer-file-our.jsonl
- Evaluate the generated responses. In our case,
answer-file-ref.jsonl
is the response generated by text-only GPT-4 (0314), with the context captions/boxes provided.
OPENAI_API_KEY="sk-***********************************" python llava/eval/eval_gpt_review_visual.py \
--question playground/data/coco2014_val_qa_eval/qa90_questions.jsonl \
--context llava/eval/table/caps_boxes_coco2014_val_80.jsonl \
--answer-list \
/path/to/answer-file-ref.jsonl \
/path/to/answer-file-our.jsonl \
--rule llava/eval/table/rule.json \
--output /path/to/review.json
- Summarize the evaluation results
python summarize_gpt_review.py
Please check out the documentation here.
- ORIGINAL PAPER and LINKS: Visual instruction tuning towards large language and vision models with GPT-4 level capabilities.
[Project Page] [Demo] [Data] [Model Zoo]
Improved Baselines with Visual Instruction Tuning [Paper]
Haotian Liu, Chunyuan Li, Yuheng Li, Yong Jae Lee
Visual Instruction Tuning (NeurIPS 2023, Oral) [Paper]
Haotian Liu*, Chunyuan Li*, Qingyang Wu, Yong Jae Lee (*Equal Contribution)