/SpatialBot

Primary LanguagePythonMIT LicenseMIT

SpatialBot

Paper

This is the official repo for "SpatialBot: Precise Spatial Understanding with Vision Language Models".

We are working hard to update this repo and paper, stay tuned!

๐Ÿ“Š SpatialQA Dataset

Image

We use LAION-2M for pretraining. The finetuning dataset is based on Bunny_695k. Please download images in Bunny_695k first, and then download SpatialQA high-level images. (Available soon)

Data json

Pretrain data json file can be found in LAION-2M. SpatialQA finetuning json file will be available soon.

Prepare your own RGB-D data

We recommend using depth information from sensors if possible. Follow instructions to prepare estimated depth information on your own RGB images.

๐Ÿค– SpatialBot Installation

SpatialBot is a multi-image version of Bunny. If you've installed Bunny, just replace the code with ours are reinstall bunny package. You can start from a docker or configure local environments.

Start from Docker

We provide a ready to run environment. Just update it with our codes:

# 1. download docker image
docker pull russellrobin/bunny:latest

# 2. run container
# docker run -itd ...
# docker exec -it ...

# 3. upgrade transformers and bunny package
cd SpatialBot && pip install --upgrade transformers && pip uninstall bunny && pip install -e .

Local Installation

Follow instructions here, but use codes from this repo.

๐Ÿ‹ SpatialBot Training

Download the base LLM and vision tower weights first. To pretrain the model:

sh script/train/pretrain.sh

To finetune SpatialBot with LoRA:

sh script/train/finetune_lora.sh

Parameters: MODEL_TYPE: base LLM type, we support phi-2, phi-3,qwen1.5-0.5b, qwen1.5-1.8b (4B), and llama3-8b.

PRETRAIN_DIR : path to a pretrained model.

OUTPUT_DIR : path to save model.

--model_name_or_path: path to base LLM.

--vision_tower: path to vision encoder. We support CLIP, SigLIP, and EVA-CLIP.

--version: for Phi-2 and QWen, use bunny. For Phi-3/Llama3, please use phi3/llama

Our pretrained model can be found here. Finetuned SpatialBot is available soon!

๐Ÿ“ƒ SpatialBot Evaluation

Follow our instructions to prepare data and evaluate SpatialBot on SpatialBench and general VLM benchmarks.

๐Ÿ”— Usage

If you find this repository helpful, please cite our paper.

@inproceedings{Cai2024SpatialBotPS,
  title={SpatialBot: Precise Spatial Understanding with Vision Language Models},
  author={Wenxiao Cai and Yaroslav Ponomarenko and Jianhao Yuan and Xiaoqi Li and Wankou Yang and Hao Dong and Bo Zhao},
  year={2024},
  url={https://api.semanticscholar.org/CorpusID:270619467}
}

๐Ÿงพ License

Code License Data License Weight License

The project employs specific datasets and checkpoints that are governed by their original licenses. Users must adhere to all terms and conditions outlined in these licenses. The checkpoints are restricted to uses that comply with the license agreements of Bunny, LLaMA 3, Phi-2, Phi-3, QWen-1.5, and GPT-4. The dataset is provided under the CC-BY-4.0 license.

๐Ÿ“ซ Acknowledgement