MO-VLN

This repository is the official PyTorch implementation of MO-VLN.

MO-VLN: A Multi-Task Benchmark for Open-set Zero-Shot Vision-and-Language Navigation
Xiwen Liang*, Liang Ma*, Shanshan Guo, Jianhua Han, Hang Xu, Shikui Ma, Xiaodan Liang^$\dagger$

*Equal contribution ^$\dagger$Corresponding Author

Update

🚀🚀[8/17/2023]v0.2.0: More assets!2 new scenes,50 new walkers,954 new objects,1k+ new instructions

We have released of the MO-VLN benchmark simulator.

Support for grabbing and navigation tasks.
Added many different walker states, including 50 unique walkers across gender, skin color, and age groups, with smooth walking or running motions.
Added walker control interface. This interface supports:
- Selecting the walker type to generate
- Specifying where walkers are generated
- Setting whether they move freely
- Controlling the speed of their movement
Added 1k+ instructions to our four tasks.
We modeled an additional 954 classes of models to construct the indoor scene.
Two new scenes have been added, bringing the total to five:
- Café
- Restaurant
- Nursing Room
- Separate tables -- Multiple tables can provide a large and efficient grasping parallel training
- Home scene -- A home suite consisting of a living room, kitchen, dining room, and multiple bedrooms

[6/18/2023]v0.1.0: 3 scenes,2165 objects, real light, shadow characteristics, and support instruction tasks with four tasks

We have released of the MO-VLN benchmark simulator.

Built on UE5.
3 scene types:
- Café -- Modelled on a 1:1 ratio to a Café
- Restaurant -- Modelled on a 1:1 ratio to a restaurant
- Nursing Room -- Modelled on a 1:1 ratio to a Nursing Room
We handcrafted 2,165 classes of models at a 1:1 ratio to real-life scenarios to construct these three scenes. These three scenes were ultimately constructed from a total of 4,230 models.
We selected 129 representative classes from the models built and supported navigation testing. Among them, 54 classes are fixed within the environment, while 73 classes support customization by users.
With real light and shadow characteristics
Support instruction tasks with four tasks:
- goal-conditioned navigation given a specific object category (e.g., "fork");
- goal-conditioned navigation given simple instructions (e.g., "Search for and move towards a tennis ball");
- step-by-step instructions following;
- finding abstract objects based on high-level instruction (e.g., "I am thirsty").

Overview

MO-VLN provides four tasks: 1) goal-conditioned navigation given a specific object category (e.g., "fork"); 2) goal-conditioned navigation given simple instructions (e.g., "Search for and move towards a tennis ball"); 3) step-by-step instruction following; 4) finding abstract object based on high-level instruction (e.g., "I am thirsty"). The earlier version of our simulator covers three high-quality scenes: cafe, restaurant, and nursing house.

Installing Dependencies

Installing the simulator following here.
Installing GLIP.
Installing Grounded-SAM.

Setup

Clone the repository and install other requirements:

git clone https://github.com/liangcici/MO-VLN.git
cd MO-VLN/
pip install -r requirements.txt

Setting up the dataset

Downloading original datasets from here.
Generate data for ObjectNav (goal-conditioned navigation given a specific object category).

python data_preprocess/gen_objectnav.py --map_id 3

map_id indicates specific scene: {3: Starbucks; 4: TG; 5: NursingRoom}.

Usage

The implementation is based on frontier-based exploration (FBE). Exploration with commonsense knowledge as in our paper is based on ESC, which is not allowed to be released. dataset/objectnav/*.npy are knowledge extracted from LLMs, and can be used to reproduce exploration with commonsense knowledge.

Run models with FBE:

For ObjectNav:

python zero_shot_eval.py --sem_seg_model_type glip --map_id 3

Related Projects

The Semantic Mapping module is based on SemExp.

To-Do List

Added more walker states.
Added walker control interface.
Provide more classes of generative objects.
Construct complex tasks involving combined navigation and grasping.
10+ scenes are under construction and will be updated successively in the future.
Generate high-quality instruction-ground truth pairs for the newly constructed scenes.
Continue to update the simulator's physics engine effects to achieve more realistic dexterous hand-grabbing effects
Adding more interactive properties to objects in the environment, such as a coffee machine that can be controlled to make coffee.

Citation

@article{liang2023mo,
  title={MO-VLN: A Multi-Task Benchmark for Open-set Zero-Shot Vision-and-Language Navigation},
  author={Liang, Xiwen and Ma, Liang and Guo, Shanshan and Han, Jianhua and Xu, Hang and Ma, Shikui and Liang, Xiaodan},
  journal={arXiv preprint arXiv:2306.10322},
  year={2023}
}