News | Introduction | Preparation | Training | Demo | Acknowledgement | Statement
- [Jul 15 2024]: We updated our paper at arxiv.
- [Jul 12 2024]: We post the missing part in our paper (some observations, considerations, and lessons) in Zhihu (In Chinese, please contact us if you need English version).
- [Jul 09 2024]: We have released our evaluation benchmark LHRS-Bench.
- [Jul 02 2024]: Our paper has been accepted by ECCV 2024! We have open-sourced our training script and training data. Please follow the training instruction belown and data preparation.
- [Feb 21 2024]: We have updated our evaluation code. Any advice are welcom!
- [Feb 7 2024]: Model weights are now available on both Google Drive and Baidu Disk.
- [Feb 6 2024]: Our paper now is available at arxiv.
- [Feb 2 2024]: We are excited to announce the release of our code and model checkpoint! Our dataset and training recipe will be update soon!
We are excited to introduce LHRS-Bot, a multimodal large language model (MLLM) that leverages globally available volunteer geographic information (VGI) and remote sensing images (RS). LHRS-Bot demonstrates a deep understanding of RS imagery and possesses the capability for sophisticated reasoning within the RS domain. In this repository, we will release our code, training framework, model weights, and dataset!
-
Clone this repository.
git clone git@github.com:NJU-LHRS/LHRS-Bot.git cd LHRS-Bot
-
Create a new virtual enviroment
conda create -n lhrs python=3.10 conda activate lhrs
-
Install dependences and our package
pip install -e .
-
LLaMA2-7B-Chat
-
Automaticaly download:
Our framework is designed to automatically download the checkpoint when you initiate training or run a demo. However, there are a few preparatory steps you need to complete:
-
Request the LLaMA2-7B-Chat models from Hugging Face.
-
After your request been processed, login to huggingface using your personal access tokens:
huggingface-cli login (Then paste your access token and press Enter)
-
Done!
-
-
Mannually download:
-
Download all the files from HuggingFace.
-
Change the following line of each file to your downloaded directory:
-
/Config/multi_modal_stage{1, 2, 3}.yaml
... text: ... path: "" # TODO: Direct to your directory ...
-
/Config/multi_modal_eval.yaml
... text: ... path: "" # TODO: Direct to your directory ...
-
-
-
-
LHRS-Bot Checkpoints:
Staeg1 Stage2 Stage3 Baidu Disk, Google Drive Baidu Disk, Google Drive Baidu Disk, Google Drive -
⚠️ Ensure that theTextLoRA
folder is located in the same directory asFINAL.pt
. The nameTextLoRA
should remain unchanged. Our framework will automatically detect the version perceiver checkpoint and, if possible, load and merge the LoRA module. -
Development Checkpoint:
We will continually update our model with advanced techniques. If you're interested, feel free to download it and have fun :)
Development Baidu Disk, Google Drive
-
-
Prepare and reformat your data following the instruction from here.
-
Stage1
- Fill the
OUTPUT_DIR
andDATA_DIR
of script1. cd Script; bash train_stage1.sh
- Fill the
-
Stage2
- Fill the
OUTPUT_DIR
andDATA_DIR
of script1 - Fill the
MODEL_PATH
for loading the stage1' checkpoint cd Script; bash train_stage2.sh
- Fill the
-
Stage3 is same as Stage2 except for different folder and script (here).
-
Online Web UI demo with gradio:
python lhrs_webui.py \ -c Config/multi_modal_eval.yaml \ # config file --checkpoint-path ${PathToCheckpoint}.pt \ # path to checkpoint end with .pt --server-port 8000 \ # change if you need --server-name 127.0.0.1 \ # change if you need --share # if you want to share with other
-
Command line demo:
python cli_qa.py \ -c Config/multi_modal_eval.yaml \ # config file --model-path ${PathToCheckpoint}.pt \ # path to checkpoint end with .pt --image-file ${TheImagePathYouWantToChat} \ # path to image file (Only Single Image File is supported) --accelerator "gpu" \ # change if you need ["mps", "cpu", "gpu"] --temperature 0.4 \ --max-new-tokens 512
-
Inference:
-
Classification
python main_cls.py \ -c Config/multi_modal_eval.yaml \ # config file --model-path ${PathToCheckpoint}.pt \ # path to checkpoint end with .pt --data-path ${ImageFolder} \ # path to classification image folder --accelerator "gpu" \ # change if you need ["mps", "cpu", "gpu"] --workers 4 \ --enabl-amp True \ --output ${YourOutputDir} # Path to output (result, metric etc.) --batch-size 8 \
-
Visual Grounding
python main_vg.py \ -c Config/multi_modal_eval.yaml \ # config file --model-path ${PathToCheckpoint}.pt \ # path to checkpoint end with .pt --data-path ${ImageFolder} \ # path to image folder --accelerator "gpu" \ # change if you need ["mps", "cpu", "gpu"] --workers 2 \ --enabl-amp True \ --output ${YourOutputDir} # Path to output (result, metric etc.) --batch-size 1 \ # It's better to use batchsize 1, since we find batch inference --data-target ${ParsedLabelJsonPath} # is not stable.
-
Visual Question Answering
python main_vqa.py \ -c Config/multi_modal_eval.yaml \ # config file --model-path ${PathToCheckpoint}.pt \ # path to checkpoint end with .pt --data-path ${Image} \ # path to image folder --accelerator "gpu" \ # change if you need ["mps", "cpu", "gpu"] --workers 2 \ --enabl-amp True \ --output ${YourOutputDir} # Path to output (result, metric etc.) --batch-size 1 \ # It's better to use batchsize 1, since we find batch inference --data-target ${ParsedLabelJsonPath} # is not stable. --data-type "HR" # choose from ["HR", "LR"]
-
-
If you find our work is useful, please give us 🌟 in GitHub and consider cite our paper:
@misc{2402.02544, Author = {Dilxat Muhtar and Zhenshi Li and Feng Gu and Xueliang Zhang and Pengfeng Xiao}, Title = {LHRS-Bot: Empowering Remote Sensing with VGI-Enhanced Large Multimodal Language Model}, Year = {2024}, Eprint = {arXiv:2402.02544}, }
-
Licence: Apache