🏘️ Airbert: In-domain Pretraining for Vision-and-Language Navigation 🏘️

This repository stores the codebase for Airbert and some pre-trained model. It is based on the codebase of VLN-BERT.

🛠️ 1. Getting started

You need to have a recent version of Python (higher than 3.6) and install dependencies:

pip install -r requirements.txt

💽 2. Preparing dataset

You need first to download the BnB dataset, prepare an LMDB file containing visual features and the BnB dataset files. Everything is described in our BnB dataset repository.

💪 3. Training Airbert

Download a checkpoint of VilBERT pre-trained on Conceptual Captions.

Fine-tune the checkpoint on the BnB dataset using one of the following path-instruction method.

To make the training faster, a SLURM script is provided with 64 GPUs. You can provide extra arguments depending on the path-instruction method.

For example:

export name=pretraining-with-captionless-insertion
echo $name
sbatch --job-name $name \
 --export=name=$name,pretrained=vilbert.bin,args=" --masked_vision --masked_language --min_captioned 2 --separators",prefix=2capt+ \
 train-bnb-8.slurm

⛓️ 3.1. Concatenation

Make sure you have the following dataset file:

data/bnb/bnb_train.json
data/bnb/bnb_test.json
data/bnb/testset.json

Then, launch training:

python train_bnb.py \
  --from_pretrained vilbert.bin \
  --save_name concatenation \
  --separators \
  --min_captioned 7 \
  --masked_vision \
  --masked_language

👥 3.2. Image merging

Make sure you have the following dataset file:

data/bnb/merge+bnb_train.json
data/bnb/merge+bnb_test.json
data/bnb/merge+testset.json

Then, launch training:

python train_bnb.py \
  --from_pretrained vilbert.bin \
  --save_name image_merging \
  --prefix merge+ \
  --min_captioned 7 \
  --separators \
  --masked_vision \
  --masked_language

👨‍👩‍👧 3.3. Captionless insertion

Make sure you have the following dataset file:

data/bnb/2capt+bnb_train.json
data/bnb/2capt+bnb_test.json
data/bnb/2capt+testset.json

Then, launch training:

python train_bnb.py \
  --from_pretrained vilbert.bin \
  --save_name captionless_insertion \
  --prefix 2capt+ \
  --min_captioned 2 \
  --separators \
  --masked_vision \
  --masked_language

👣 3.4. Instruction rephrasing

Make sure you have the following dataset file:

data/bnb/np+bnb_train.json
data/bnb/np+bnb_test.json
data/bnb/np+testset.json
data/np_train.json

Then, launch training:

python train_bnb.py \
  --from_pretrained vilbert.bin \
  --save_name instruction_rephrasing \
  --prefix np+ \
  --min_captioned 7 \
  --separators \
  --masked_vision \
  --masked_language \
  --skeleton data/np_train.json

🕵️ 4. Fine-tuning on R2R in Discriminative Setting

First of all, you need to download the R2R data:

make r2r

4.1. Fine-tune with masking losses

python train.py \
  --from_pretrained bnb-pretrained.bin \
  --save_name r2rM \
  --masked_language --masked_vision --no_ranking

4.2. Fine-tune with the ranking and the shuffling loss

python train.py \
  --from_pretrained r2rM.bin \
  --save_name r2rRS \
  --shuffle_visual_features

4.3. Fine-tune with the ranking and the shuffling loss and the speaker data augmented

Download the augmented paths from EnvDrop:

make speaker

Then use the train.py script:

python train.py \
  --from_pretrained r2rM.bin \
  --save_name r2rRS \
  --shuffle_visual_features \
  --prefix aug+ \
  --beam_prefix aug_

You can download a pretrained model from our model zoo.

🧪 5. Testing Airbert on R2R with a Discriminative Setting

pushd ../model-zoos # https://github.com/airbert-vln/model-zoos
make airbert-r2rRSA
popd

# Install dependencies if not already done
poetry install

# Download data if not already done
make r2r
make lmdb

poetry run python test.py \
  --from_pretrained ../model-zoos/airbert-r2rRSA.bin \
  --save_name testing \
  --split val_unseen

🤰 6. Fine-tuning on REVERIE and R2R in Generative Setting

Please see the repository dedicated for finetuning Airbert in generative setting.

🍀 7. Few-shot learning

The datasets are provided in data/task/

Citing our paper

See the BibTex file.

airbert-vln/airbert

🏘️ Airbert: In-domain Pretraining for Vision-and-Language Navigation 🏘️

🛠️ 1. Getting started

💽 2. Preparing dataset

💪 3. Training Airbert

⛓️ 3.1. Concatenation

👥 3.2. Image merging

👨‍👩‍👧 3.3. Captionless insertion

👣 3.4. Instruction rephrasing

🕵️ 4. Fine-tuning on R2R in Discriminative Setting

4.1. Fine-tune with masking losses

4.2. Fine-tune with the ranking and the shuffling loss

4.3. Fine-tune with the ranking and the shuffling loss and the speaker data augmented

🧪 5. Testing Airbert on R2R with a Discriminative Setting

🤰 6. Fine-tuning on REVERIE and R2R in Generative Setting

🍀 7. Few-shot learning

Citing our paper