/LHRS-Bot

VGI-Enhanced multimodal large language model for remote sensing images.

Primary LanguagePythonApache License 2.0Apache-2.0

LHRS-Bot: Empowering Remote Sensing with VGI-Enhanced Large Multimodal Language Model

Dilxat Muhtar*, Zhenshi Li* , Feng Gu, Xueliang Zhang, and Pengfeng Xiao
(*Equal Contribution)

News | Introduction | Preparation | Demo | Acknowledgement | Statement

News

  • [Feb 21 2024]: We have updated our evaluation code. Any advice are welcom!
  • [Feb 7 2024]: Model weights are now available on both Google Drive and Baidu Disk.
  • [Feb 6 2024]: Our paper now is available at arxiv.
  • [Feb 2 2024]: We are excited to announce the release of our code and model checkpoint! Our dataset and training recipe will be update soon!

Introduction

We are excited to introduce LHRS-Bot, a multimodal large language model (MLLM) that leverages globally available volunteer geographic information (VGI) and remote sensing images (RS). LHRS-Bot demonstrates a deep understanding of RS imagery and possesses the capability for sophisticated reasoning within the RS domain. In this repository, we will release our code, training framework, model weights, and dataset!

Preparation

Installation

  1. Clone this repository.

    git clone git@github.com:NJU-LHRS/LHRS-Bot.git
    cd LHRS-Bot
  2. Create a new virtual enviroment

    conda create -n lhrs python=3.10
    conda activate lhrs
  3. Install dependences and our package

    pip install -e .

Checkpoints

  • LLaMA2-7B-Chat

    • Automaticaly download:

      Our framework is designed to automatically download the checkpoint when you initiate training or run a demo. However, there are a few preparatory steps you need to complete:

      1. Request the LLaMA2-7B models from Meta website.

      2. After your request been processed, login to huggingface using your personal access tokens:

        huggingface-cli login
        (Then paste your access token and press Enter)
      3. Done!

    • Mannually download:

      • Download all the files from HuggingFace.

      • Change the following line of each file to your downloaded directory:

        • /Config/multi_modal_stage{1, 2, 3}.yaml

          ...
          text:
          	...
            path: ""  # TODO: Direct to your directory
          ...
        • /Config/multi_modal_eval.yaml

          ...
          text:
          	...
            path: ""  # TODO: Direct to your directory
          ...
  • LHRS-Bot Checkpoints:

    Staeg1 Stage2 Stage3
    Baidu Disk, Google Drive Baidu Disk, Google Drive Baidu Disk, Google Drive
    • ⚠️ Ensure that the TextLoRA folder is located in the same directory as FINAL.pt. The name TextLoRA should remain unchanged. Our framework will automatically detect the version perceiver checkpoint and, if possible, load and merge the LoRA module.

    • Development Checkpoint:

      We will continually update our model with advanced techniques. If you're interested, feel free to download it and have fun :)

      Development
      Baidu Disk, Google Drive

Demo

  • Online Web UI demo with gradio:

    python lhrs_webui.py \
         -c Config/multi_modal_eval.yaml \           # config file
         --checkpoint-path ${PathToCheckpoint}.pt \  # path to checkpoint end with .pt
         --server-port 8000 \                        # change if you need
         --server-name 127.0.0.1 \                   # change if you need
         --share                                     # if you want to share with other
  • Command line demo:

    python cli_qa.py \
         -c Config/multi_modal_eval.yaml \                 # config file
         --model-path ${PathToCheckpoint}.pt \             # path to checkpoint end with .pt
         --image-file ${TheImagePathYouWantToChat} \       # path to image file (Only Single Image File is supported)
         --accelerator "gpu" \                             # change if you need ["mps", "cpu", "gpu"]
         --temperature 0.4 \
         --max-new-tokens 512
  • Inference:

    • Classification

      python main_cls.py \
           -c Config/multi_modal_eval.yaml \                 # config file
           --model-path ${PathToCheckpoint}.pt \             # path to checkpoint end with .pt
           --data-path ${ImageFolder} \                      # path to classification image folder
           --accelerator "gpu" \                             # change if you need ["mps", "cpu", "gpu"]
           --workers 4 \
           --enabl-amp True \
           --output ${YourOutputDir}                         # Path to output (result, metric etc.)
           --batch-size 8 \
    • Visual Grounding

      python main_vg.py \
           -c Config/multi_modal_eval.yaml \                 # config file
           --model-path ${PathToCheckpoint}.pt \             # path to checkpoint end with .pt
           --data-path ${ImageFolder} \                      # path to image folder
           --accelerator "gpu" \                             # change if you need ["mps", "cpu", "gpu"]
           --workers 2 \
           --enabl-amp True \
           --output ${YourOutputDir}                         # Path to output (result, metric etc.)
           --batch-size 1 \                                  # It's better to use batchsize 1, since we find batch inference
           --data-target ${ParsedLabelJsonPath}              # is not stable.
    • Visual Question Answering

      python main_vqa.py \
           -c Config/multi_modal_eval.yaml \                 # config file
           --model-path ${PathToCheckpoint}.pt \             # path to checkpoint end with .pt
           --data-path ${Image} \                            # path to image folder
           --accelerator "gpu" \                             # change if you need ["mps", "cpu", "gpu"]
           --workers 2 \
           --enabl-amp True \
           --output ${YourOutputDir}                         # Path to output (result, metric etc.)
           --batch-size 1 \                                  # It's better to use batchsize 1, since we find batch inference
           --data-target ${ParsedLabelJsonPath}              # is not stable.
           --data-type "HR"                                  # choose from ["HR", "LR"]

Acknowledgement

Statement

  • If you find our work is useful, please give us 🌟 in GitHub and consider cite our paper:

    @misc{2402.02544,
    Author = {Dilxat Muhtar and Zhenshi Li and Feng Gu and Xueliang Zhang and Pengfeng Xiao},
    Title = {LHRS-Bot: Empowering Remote Sensing with VGI-Enhanced Large Multimodal Language Model},
    Year = {2024},
    Eprint = {arXiv:2402.02544},
    }
  • Licence: Apache