This is the official repo for "SpatialBot: Precise Spatial Understanding with Vision Language Models".
We are working hard to update this repo and paper, stay tuned!
We use LAION-2M for pretraining. The finetuning dataset is based on Bunny_695k. Please download images in Bunny_695k first, and then download SpatialQA high-level images. (Available soon)
Pretrain data json file can be found in LAION-2M. SpatialQA finetuning json file will be available soon.
We recommend using depth information from sensors if possible. Follow instructions to prepare estimated depth information on your own RGB images.
SpatialBot is a multi-image version of Bunny.
If you've installed Bunny, just replace the code with ours are reinstall bunny
package.
You can start from a docker or configure local environments.
We provide a ready to run environment. Just update it with our codes:
# 1. download docker image
docker pull russellrobin/bunny:latest
# 2. run container
# docker run -itd ...
# docker exec -it ...
# 3. upgrade transformers and bunny package
cd SpatialBot && pip install --upgrade transformers && pip uninstall bunny && pip install -e .
Follow instructions here, but use codes from this repo.
Download the base LLM and vision tower weights first. To pretrain the model:
sh script/train/pretrain.sh
To finetune SpatialBot with LoRA:
sh script/train/finetune_lora.sh
Parameters:
MODEL_TYPE
: base LLM type, we support phi-2, phi-3,qwen1.5-0.5b, qwen1.5-1.8b (4B), and llama3-8b
.
PRETRAIN_DIR
: path to a pretrained model.
OUTPUT_DIR
: path to save model.
--model_name_or_path
: path to base LLM.
--vision_tower
: path to vision encoder. We support CLIP, SigLIP, and EVA-CLIP.
--version
: for Phi-2 and QWen, use bunny
. For Phi-3/Llama3
, please use phi3/llama
Our pretrained model can be found here. Finetuned SpatialBot is available soon!
Follow our instructions to prepare data and evaluate SpatialBot on SpatialBench and general VLM benchmarks.
If you find this repository helpful, please cite our paper.
@inproceedings{Cai2024SpatialBotPS,
title={SpatialBot: Precise Spatial Understanding with Vision Language Models},
author={Wenxiao Cai and Yaroslav Ponomarenko and Jianhao Yuan and Xiaoqi Li and Wankou Yang and Hao Dong and Bo Zhao},
year={2024},
url={https://api.semanticscholar.org/CorpusID:270619467}
}
The project employs specific datasets and checkpoints that are governed by their original licenses. Users must adhere to all terms and conditions outlined in these licenses. The checkpoints are restricted to uses that comply with the license agreements of Bunny, LLaMA 3, Phi-2, Phi-3, QWen-1.5, and GPT-4. The dataset is provided under the CC-BY-4.0 license.
- The training of this work is built upon the Bunny: A family of lightweight multimodal models.
- This work utilizes LLMs from Phi-2, Phi-3, QWen-1.5-0.5B ,QWen-1.5-4B , and Llama-3-8B.