/MO-VLN

Primary LanguagePythonMIT LicenseMIT

MO-VLN

Static Badge Benchmark Page Static Badge Download Datasets Download-Simulator

This repository is the official PyTorch implementation of MO-VLN.

MO-VLN: A Multi-Task Benchmark for Open-set Zero-Shot Vision-and-Language Navigation
Xiwen Liang*, Liang Ma*, Shanshan Guo, Jianhua Han, Hang Xu, Shikui Ma, Xiaodan Liang$\dagger$

*Equal contribution $\dagger$Corresponding Author


Update

🚀🚀[8/17/2023]v0.2.0: More assets!2 new scenes,50 new walkers,954 new objects,1k+ new instructions

We have released version of the MO-VLN benchmark simulator.

  • Support for grabbing and navigation tasks.
  • Added many different walker states, including 50 unique walkers across gender, skin color, and age groups, with smooth walking or running motions.
  • Added walker control interface. This interface supports:
    • Selecting the walker type to generate
    • Specifying where walkers are generated
    • Setting whether they move freely
    • Controlling the speed of their movement
  • Added 1k+ instructions to our four tasks.
  • We modeled an additional 954 classes of models to construct the indoor scene.
  • Two new scenes have been added, bringing the total to five:
    • Café
    • Restaurant
    • Nursing Room
    • Separate tables -- Multiple tables can provide a large and efficient grasping parallel training
    • Home scene -- A home suite consisting of a living room, kitchen, dining room, and multiple bedrooms scene
[6/18/2023]v0.1.0: 3 scenes,2165 objects, real light, shadow characteristics, and support instruction tasks with four tasks

We have released version of the MO-VLN benchmark simulator.

  • Built on UE5.
  • 3 scene types:
    • Café -- Modelled on a 1:1 ratio to a Café
    • Restaurant -- Modelled on a 1:1 ratio to a restaurant
    • Nursing Room -- Modelled on a 1:1 ratio to a Nursing Room
  • We handcrafted 2,165 classes of models at a 1:1 ratio to real-life scenarios to construct these three scenes. These three scenes were ultimately constructed from a total of 4,230 models.
  • We selected 129 representative classes from the models built and supported navigation testing. Among them, 54 classes are fixed within the environment, while 73 classes support customization by users.
  • With real light and shadow characteristics
  • Support instruction tasks with four tasks:
    • goal-conditioned navigation given a specific object category (e.g., "fork");
    • goal-conditioned navigation given simple instructions (e.g., "Search for and move towards a tennis ball");
    • step-by-step instructions following;
    • finding abstract objects based on high-level instruction (e.g., "I am thirsty").

Overview

MO-VLN provides four tasks: 1) goal-conditioned navigation given a specific object category (e.g., "fork"); 2) goal-conditioned navigation given simple instructions (e.g., "Search for and move towards a tennis ball"); 3) step-by-step instruction following; 4) finding abstract object based on high-level instruction (e.g., "I am thirsty"). The earlier version of our simulator covers three high-quality scenes: cafe, restaurant, and nursing house.

scene

task

Installing Dependencies

Setup

Clone the repository and install other requirements:

git clone https://github.com/liangcici/MO-VLN.git
cd MO-VLN/
pip install -r requirements.txt

Setting up the dataset

  • Downloading original datasets from here.

  • Generate data for ObjectNav (goal-conditioned navigation given a specific object category).

python data_preprocess/gen_objectnav.py --map_id 3

map_id indicates specific scene: {3: Starbucks; 4: TG; 5: NursingRoom}.

Usage

The implementation is based on frontier-based exploration (FBE). Exploration with commonsense knowledge as in our paper is based on ESC, which is not allowed to be released. dataset/objectnav/*.npy are knowledge extracted from LLMs, and can be used to reproduce exploration with commonsense knowledge.

Run models with FBE:

  • For ObjectNav:
python zero_shot_eval.py --sem_seg_model_type glip --map_id 3

Related Projects

  • The Semantic Mapping module is based on SemExp.

To-Do List

  • Added more walker states.
  • Added walker control interface.
  • Provide more classes of generative objects.
  • Construct complex tasks involving combined navigation and grasping.
  • 10+ scenes are under construction and will be updated successively in the future.
  • Generate high-quality instruction-ground truth pairs for the newly constructed scenes.
  • Continue to update the simulator's physics engine effects to achieve more realistic dexterous hand-grabbing effects
  • Adding more interactive properties to objects in the environment, such as a coffee machine that can be controlled to make coffee.

Citation

@article{liang2023mo,
  title={MO-VLN: A Multi-Task Benchmark for Open-set Zero-Shot Vision-and-Language Navigation},
  author={Liang, Xiwen and Ma, Liang and Guo, Shanshan and Han, Jianhua and Xu, Hang and Ma, Shikui and Liang, Xiaodan},
  journal={arXiv preprint arXiv:2306.10322},
  year={2023}
}