Trained with an unprecedented scale of data, large language models (LLMs) like ChatGPT and GPT-4 exhibit the emergence of significant reasoning abilities from model scaling. Such a trend underscored the potential of training LLMs with unlimited language data, advancing the development of a universal embodied agent. In this work, we introduce the NavGPT, a purely LLM-based instruction-following navigation agent, to reveal the reasoning capability of GPT models in complex embodied scenes by performing zero-shot sequential action prediction for vision-and-language navigation (VLN). At each step, NavGPT takes the textual descriptions of visual observations, navigation history, and future explorable directions as inputs to reason the agent's current status, and makes the decision to approach the target. Through comprehensive experiments, we demonstrate NavGPT can explicitly perform high-level planning for navigation, including decomposing instruction into sub-goal, integrating commonsense knowledge relevant to navigation task resolution, identifying landmarks from observed scenes, tracking navigation progress, and adapting to exceptions with plan adjustment. Furthermore, we show that LLMs is capable of generating high-quality navigational instructions from observations and actions along a path, as well as drawing accurate top-down metric trajectory given the agent's navigation history. Despite the performance of using NavGPT to zero-shot R2R tasks still falling short of trained models, we suggest adapting multi-modality inputs for LLMs to use as visual navigation agents and applying the explicit reasoning of LLMs to benefit learning-based models.
- Release 🎇NavGPT code.
- Data preprocessing code.
- Custuomized LLM inference guidance.
Create a conda environment and install all dependencies:
conda create --name NavGPT python=3.9
conda activate NavGPT
pip install -r requirements.txt
Download R2R data from Dropbox. Put the data in datasets
directory.
Related data preprocessing code can be found in nav_src/scripts
.
Get an OpenAI API Key and add to your environment variables:
# prepare your private OpenAI key (for Linux)
export OPENAI_API_KEY={Your_Private_Openai_Key}
# prepare your private OpenAI key (for Windows)
set OPENAI_API_KEY={Your_Private_Openai_Key}
Alternatively, you can set the key in your code:
import os
os.environ["OPENAI_API_KEY"] = {Your_Private_Openai_Key}
To replicate the performance reported in our paper, use GPT-4 and run validation with following configuration:
cd nav_src
python NavGPT.py --llm_model_name gpt-4 \
--output_dir ../datasets/R2R/exprs/gpt-4-val-unseen \
--val_env_name R2R_val_unseen_instr
Results will be saved in datasets/R2R/exprs/gpt-4-val-unseen
directory.
The defualt --llm_model_name
is set as gpt-3.5-turbo
.
An economic way to try 🎇NavGPT is by using GPT-3.5 and run validation on the first 10 samples with following configuration:
cd nav_src
python NavGPT.py --llm_model_name gpt-3.5-turbo \
--output_dir ../datasets/R2R/exprs/gpt-3.5-turbo-test \
--val_env_name R2R_val_unseen_instr \
--iters 10
Add your own model repo as a submodule under nav_src/LLMs/
:
cd nav_src/LLMs
git submodule add {Your_Model_Repo}
or just copy your local inference code under nav_src/LLMs/
.
Follow the instructions to set up your own LLMs for 🎇NavGPT.
Run 🎇NavGPT with your custom LLM:
cd nav_src
python NavGPT.py --llm_model_name your_custom_llm \
--output_dir ../datasets/R2R/exprs/your_custom_llm-test
If 🎇NavGPT
has been beneficial to your research and work, please cite our work using the following format:
@article{zhou2023navgpt,
title={NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large Language Models},
author={Zhou, Gengze and Hong, Yicong and Wu, Qi},
journal={arXiv preprint arXiv:2305.16986},
year={2023}
}