PixelNav: Bridging Zero-Shot Object Navigation and Foundation Models through Pixel-Guided Navigation Skill, ICRA 2024
This is the official implementation of the paper. Please refer to the paper and website for more technique details.
Our project is based on the habitat-sim and habitat-lab. Please follow the guides to install them in your python environment. You can directly install the latest version of habitat-lab and habitat-sim. And make sure you have properly download the navigation scenes (HM3D, MP3D) and the episode dataset for object navigation. Besides, make sure you have installed the following dependencies on your python environment:
numpy, opencv-python, tqdm, openai, torch
Firstly, clone our repo as:
git clone https://github.com/wzcai99/Pixel-Navigator.git
cd Pixel-Navigator-master
Our method depends on an open-vocalbulary detection module GroundingDINO and a segmentation module Segment-Anything. You can either follow their website and follow the installation guide or just enter our /thirdparty directory and install locally as:
cd third_party/GroundingDINO
pip install -e .
cd ../Segment-Anything/
pip install -e .
To emphasize our contribution on the pixel navigation skill, in this repo, we replace the original complicated high-level planning process with GPT-4V. And you should prepare for your own api-keys and api-endpoint. You can check the ./llm_utils/gpt_request.py for more details.
export OPENAI_API_KEY=<YOUR KEYS>
export OPENAI_API_ENDPOINT=<YOUR_ENDPOINT>
Module | Approach | Weight | Config |
---|---|---|---|
Object Detection | GroundingDINO | groundingdion_swinb_cogcoor.pth | GroundingDINO_SwinB_cfg.py |
Object Segmentation | SAM | sam_vit_h_4b8939.pth | vit-h |
Navigation Skill | PixelNav | Checkpoint_A,Chekcpoint_B,Checkpoint_C | ---- |
We provide several different checkpoint for pixel navigation skill, which is trained on different dataset (scale,scenes..). You can make a choice for your own projects.
Open the constants.py file and make sure you have prepared for all the input directory and checkpoint path. Then run the following command:
# prefix to decide hm3d or mp3d, difficulty to decide the pixelnav goal distance
# sensor_height to decide the camera height and image_hfov to decide the camera hfov
python evaluate_policy.py --prefix=hm3d --difficulty=easy --sensor_height=0.88 --image_hfov=79
The python code will automatically record the navigation process into the ./PREFIX_eval_trajectory/. The left side of the mp4 file records the image at the first frame and a pixel goal indicated as the blue dots. The right side of the mp4 file records the video stream of the navigation process and the estimated pixel goal and temporal distance. Examples show as follows:
fps_hm3d.mp4
fps_mp3d.mp4
Open the constants.py file and make sure you have prepared for all the input directory and checkpoint path. Then run the following command:
python objnav_benchmark.py --checkpoint=<PIXELNAV_CHECKPOINT_PATH>
If everything goes well, you will see a new /tmp directory saving recording the navigation process. Examples show as below:
objnav_fps.mp4
objnav_metric.mp4
Please cite our paper if you find it helpful :)
@inproceedings{cai2024bridging,
title={Bridging zero-shot object navigation and foundation models through pixel-guided navigation skill},
author={Cai, Wenzhe and Huang, Siyuan and Cheng, Guangran and Long, Yuxing and Gao, Peng and Sun, Changyin and Dong, Hao},
booktitle={2024 IEEE International Conference on Robotics and Automation (ICRA)},
pages={5228--5234},
year={2024},
organization={IEEE}
}