Layout-aware Dreamer for Embodied Referring Expression Grounding

Mingxiao Li*, Zehao Wang*, Tinne Tuytelaars, Marie-Francine Moens

AAAI 2023 main conference

Paper / BibTeX

Environment

Please setup Matterport3DSimulator docker env following link

For missing packages, please check the corresponding version in requirements.txt

Data preparation

The data preparation including two step, preprocessing for image generation and token id extraction

Downloads

Follow the insturction in vln-duet, or download data from Dropbox including processed annotations, features. Unzip the REVERIE and R2R folder into datasets
Since we mainly use CLIP as our visual feature encoder, please follow the instruction in link and make sure to load ViT-L-14-336px.pt during training. Recommand to put in ckpts/ViT-L-14-336px.pt
Make sure to install GLIDE for generation
Download Matterport3D dataset from link
Additional data from lad is released at link

Preprocessing

Generate imagined image of goal position

python preprocess/ge_ins2img_feats.py --encoder clip --dataset reverie \
--input_dir datasets/REVERIE/annotations/REVERIE_{split}_enc.json \
--clip_save_dir datasets/REVERIE/features/reverie_ins2img_clip.h5 \
--collect_clip

Put the generated data in the directory datasets/REVERIE/features

The room type codebook room_type_feats.h5 has been provided at root directory

Generate CLIP features for Matterport3 environment

Setup the output path and Matterport3D connectivity path in preprocess/get_all_imgs_fts.py
Run bellow to get tsv file.
```
   python preprocess/get_all_imgs_fts.py
```
Download the vit feature following VLN-DUET and put it in the directore of datasets/REVERIE/features
Setup path in preprocess/convert_tsv2h5.py
Run to get .h5 file and put is in the directory datasets/REVERIE/features
```
python preprocess/convert_tsv2h5.py
```

Data arrangement

Make sure the datasets folder under root lad_src
link matterport dataset to mp3d under lad_src folder The structure of these two dataset folders should be organized as

lad_src
├──  datasets
│    ├── REVERIE
│    │    ├── annotations
│    │    └── features
│    │        ├── obj.avg.top3.min80_vit_base_patch16_224_imagenet.hdf5 
│    │        └── full_reverie_ins2img_clip.h5
|    └── R2R
├──  mp3d
│    └── v1
          └── scans

Running scripts

Since ins2img consume too much disk space in our situition, for augmentation data in phase1, we do not include goal dreamer in the warmup training

Warmup stage - phase1 training with augmentation data for single-action prediction

cd warmup_src
sh scripts/final_frt_gd_phase1.sh

Warmup stage - phase2 training with training data for single-action prediction

cd warmup_src
sh scripts/final_frt_gd_phase2.sh # need replace phase_ckpt in this script by best phase1 results

Training stage

cd training_src
sh scripts/final_frt_gd_finetuning_stable.sh # need replace phase_ckpt in this script by best phase1 results

Evaluation script

cd training_src
sh scripts/eval.sh # need replace resumedir in this script to best training result obtained above

NOTE: The checkpoints of LAD model after warmup stage 2 and final LAD model trained on REVERIE dataset can be found here

Acknowledgement

Credits to Shizhe Chen for the great baseline work VLN-DUET:

@InProceedings{Chen_2022_DUET,
    author    = {Chen, Shizhe and Guhur, Pierre-Louis and Tapaswi, Makarand and Schmid, Cordelia and Laptev, Ivan},
    title     = {Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation},
    booktitle = {CVPR},
    year      = {2022}
}

License and Citation

@InProceedings{VLN_LAD_2023,
    author    = {Li, Mingxiao and Wang, Zehao and Tuytelaars, Tinne and Moens, Marie-Francine},
    title     = {Layout-aware Dreamer for Embodied Referring Expression Grounding},
    booktitle = {AAAI},
    year      = {2023}
}

zehao-wang/LAD