NvEM

Code of the paper: Neighbor-view Enhanced Model for Vision and Language Navigation (ACM MM2021 oral)
Dong An, Yuankai Qi, Yan Huang, Qi Wu, Liang Wang, Tieniu Tan

[Paper] [GitHub]

Motivation

Most of existing works represent a navigable candidate by the feature of the corresponding single view where the candidate lies in. However, an instruction may mention landmarks out of the single view as references, which might lead to failures of textual-visual matching of existing methods. In this work, we propose a multi-module Neighbor-View Enhanced Model (NvEM) to adaptively incorporate visual contexts from neighbor views for better textual-visual matching.

Prerequisites

Installation

Install the Matterport3D Simulator. Please note that the code is based on Simulator-v2.

Please find the versions of packages in our environment in requirements.txt. In particular, we use:

Python 3.6.9
NumPy 1.19.1
OpenCV 4.1.0.25
PyTorch 0.4.0
Torchvision 0.1.8

Data Preparation

Please follow the instructions below to prepare the data in directories:

connectivity
- Download the connectivity maps [23.8MB].
data
- Download the R2R data [5.8MB].
- Download the vocabulary and the augmented data from EnvDrop [79.5MB].
img_features
- Download the Scene features [4.2GB] (ResNet-152-Places365).
- Download the pre-processed Object features and vocabulary [1.3GB] (Caffe Faster-RCNN).
GT for CLS score
- Download the id_paths.json [1.4MB], put it in tasks/R2R/data/

Trained Network Weights

snap
- Download the trained network weights [116.0MB]

R2R Navigation

Please read Peter Anderson's VLN paper for the R2R Navigation task.

Our code is based on the code structure of EnvDrop and Recurrent VLN-Bert.

Reproduce Testing Results

To replicate the performance reported in our paper, load the trained network weights and run validation:

bash run/valid.bash 0

Here is the full log:

Loaded the listener model at iter 119600 from snap/NvEM_bt/state_dict/best_val_unseen
Env name: val_seen, nav_error: 3.4389, oracle_error: 2.1848, steps: 5.5749, lengths: 11.2468, success_rate: 0.6866, oracle_rate: 0.7640, spl: 0.6456
Env name: val_unseen, nav_error: 4.2603, oracle_error: 2.8130, steps: 6.3585, lengths: 12.4147, success_rate: 0.6011, oracle_rate: 0.6790, spl: 0.5497

Training

Navigator

To train the network from scratch, first train a Navigator on the R2R training split:

bash run/follower.bash 0

The trained Navigator will be saved under snap/.

Speaker

You also need to train a Speaker for augmented training:

bash run/speaker.bash 0

The trained Speaker will be saved under snap/.

Augmented Navigator

Finally, keep training the Navigator with the mixture of original data and augmented data:

bash run/follower_bt.bash 0

Citation

If you use or discuss our Neighbor-view Enhanced Model, please cite our paper:

@misc{an2021neighborview,
      title={Neighbor-view Enhanced Model for Vision and Language Navigation}, 
      author={Dong An and Yuankai Qi and Yan Huang and Qi Wu and Liang Wang and Tieniu Tan},
      year={2021},
      eprint={2107.07201},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

CrystalSixone/NvEM