/YouTube-VLN

[ICCV'23] Learning Vision-and-Language Navigation from YouTube Videos

Primary LanguagePythonMIT LicenseMIT

Learning Vision-and-Language Navigation from YouTube Videos

Kunyang Lin1 2*    Peihao Chen1*    Diwei Huang1    Thomas H. Li6    Mingkui Tan1 5†    Chuang Gan3 4   
1South China University of Technology    2Key Laboratory of Big Data and Intelligent Robot, Ministry of Education    3UMass Amherst    4MIT-IBM Watson AI Lab    5Key Laboratory of Big Data and Intelligent Robot, Ministry of Education    6Peking University Shenzhen Graduate School

Getting started

This project is developed with Python 3.6.13, Pytorch 1.10.1. Please install dependencies by follows:

conda env create -f env.yaml
conda activate lily

or install the environment by

pip install -r requirements.txt

Some packages may be missed you need to refer to the requirements.txt to install manually.

Preparing dataset

We provide the detailed construction process of our proposed YouTube-VLN dataset in YouTube_VLN.md. The whole process may take a certain amount of time. If you want to directly use the generated dataset for training, please download the following data:blush:.

1、Download the image features (totally 11 files) and put them into data/YouTube-VLN/youtube_img_features:

image features 0、 image features 1、 image features 2、 image features 3、 image features 4、 image features 5、 image features 6、 image features 7、 image features 8、 image features 9、 image features 10

2、Download the trainset and testset put them into data/YouTube-VLN/ytb.

3、Download the checkpoint of VilBERT pre-trained on Conceptual Captions and then put it into data/YouTube-VLN.

4、Download the matterport-ResNet-101-faster-rcnn features and unzip it and then put it into data/YouTube-VLN.

5、Download the instruction template and then put it into data/task.

6、Follow download.py to download the other data of tasks.

python scripts/download.py

Training

1. Pre-traing Lily using YouTube-VLN

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch \
    --nproc_per_node 4 \
    --master_port 1234 \
    -m pretrain \
    --pre_dataset ytb \
    --from_pretrained data/pretrained_model.bin \
    --save_name ytbvln_2e5_500_MRT \
    --prefix merge+ \
    --separators \
    --masked_vision \
    --masked_language \
    --ranking \
    --traj_judge \
    --batch_size 8 \
    --learning_rate 2e-5 \
    --num_epochs 500 \
    --save_epochs 100

2. Fine-tune with masking loss

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch \
    --nproc_per_node 4 \
    --master_port 5555 \
    -m train \
    --from_pretrained result/ytbvln_2e5_500_MRT/data/best_ranking.bin \
    --save_name ytbvln_2e5_500_MRT_ranking_30M \
    --masked_vision \
    --masked_language \
    --batch_size 12 \
    --num_epochs 30

3. Fine-tune with ranking loss and shuffling loss

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.launch \
    --nproc_per_node 8 \
    --master_port 5555 \
    -m train \
    --from_pretrained result/ytbvln_2e5_500_MRT_ranking_30M/data/29.bin \
    --save_name ytbvln_2e5_500_MRT_ranking_30M_30RS \
    --shuffle_visual_features \
    --ranking \
    --batch_size 16 \
    --num_epochs 30

4. Fine-tune with ranking loss and shuffling loss using speaker augmented data

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.launch \
    --nproc_per_node 8 \
    --master_port 5555 \
    -m train \
    --from_pretrained result/ytbvln_2e5_500_MRT_ranking_30M/data/29.bin \
    --save_name ytbvln_2e5_500_MRT_ranking_30M_30RSA \
    --prefix aug+ \
    --beam_prefix aug_ \
    --shuffle_visual_features \
    --ranking \
    --batch_size 16 \
    --num_epochs 30

Testing

CUDA_VISIBLE_DEVICES=0 python test.py \
  --from_pretrained result/ytbvln_2e5_500_MRT_ranking_30M_30RSA/data/best_unseen.bin \
  --save_name ytbvln_2e5_500_MRT_ranking_30M_30RSA \
  --split val_unseen

python scripts/calculate-metrics.py results/ytbvln_2e5_500_MRT_ranking_30M_30RSA/test_val_unseen/_results_val_unseen.json

Here we provide our trained model, feel free to test it.

Citation

If you find this work helpful, please kindly consider citing our paper:

@article{lin2023ytbvln,
  title = {Learning Vision-and-Language Navigation from YouTube Videos},
  author = {Lin, Kunyang and Chen, Peihao and Huang, Diwei and Li, Thomas H. and Tan, Mingkui and Gan, Chuang},
  journal = {arXiv preprint arXiv:2307.11984}, 
  year = {2023},
}
@misc{lin2023ytbvln,
  title = {Learning Vision-and-Language Navigation from YouTube Videos},
  author = {Lin, Kunyang and Chen, Peihao and Huang, Diwei and Li, Thomas H. and Tan, Mingkui and Gan, Chuang},
  howpublished = {\url{https://github.com/JeremyLinky/YouTube-VLN}}, 
  year = {2023},
}

Acknowledgements

Our code is partially modified from Airbert, video-dqn and Probes-VLN. Thanks for their awesome works and please consider citing them at the same time.

Contact

For any questions, please feel free to file an issue or contact:revolving_hearts::

Kunyang Lin: imkunyanglin@gmail.com
Diwei Huang: sediweihuang@gmail.com