Kunyang Lin1 2*
Peihao Chen1*
Diwei Huang1
Thomas H. Li6
Mingkui Tan1 5†
Chuang Gan3 4
1South China University of Technology
2Key Laboratory of Big Data and Intelligent Robot, Ministry of Education
3UMass Amherst
4MIT-IBM Watson AI Lab
5Key Laboratory of Big Data and Intelligent Robot, Ministry of Education
6Peking University Shenzhen Graduate School
This project is developed with Python 3.6.13, Pytorch 1.10.1. Please install dependencies by follows:
conda env create -f env.yaml
conda activate lily
or install the environment by
pip install -r requirements.txt
Some packages may be missed you need to refer to the requirements.txt to install manually.
We provide the detailed construction process of our proposed YouTube-VLN dataset in YouTube_VLN.md. The whole process may take a certain amount of time. If you want to directly use the generated dataset for training, please download the following data:blush:.
1、Download the image features (totally 11 files) and put them into data/YouTube-VLN/youtube_img_features:
image features 0、 image features 1、 image features 2、 image features 3、 image features 4、 image features 5、 image features 6、 image features 7、 image features 8、 image features 9、 image features 10
2、Download the trainset and testset put them into data/YouTube-VLN/ytb.
3、Download the checkpoint of VilBERT pre-trained on Conceptual Captions and then put it into data/YouTube-VLN.
4、Download the matterport-ResNet-101-faster-rcnn features and unzip it and then put it into data/YouTube-VLN.
5、Download the instruction template and then put it into data/task.
6、Follow download.py to download the other data of tasks.
python scripts/download.py
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch \
--nproc_per_node 4 \
--master_port 1234 \
-m pretrain \
--pre_dataset ytb \
--from_pretrained data/pretrained_model.bin \
--save_name ytbvln_2e5_500_MRT \
--prefix merge+ \
--separators \
--masked_vision \
--masked_language \
--ranking \
--traj_judge \
--batch_size 8 \
--learning_rate 2e-5 \
--num_epochs 500 \
--save_epochs 100
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch \
--nproc_per_node 4 \
--master_port 5555 \
-m train \
--from_pretrained result/ytbvln_2e5_500_MRT/data/best_ranking.bin \
--save_name ytbvln_2e5_500_MRT_ranking_30M \
--masked_vision \
--masked_language \
--batch_size 12 \
--num_epochs 30
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.launch \
--nproc_per_node 8 \
--master_port 5555 \
-m train \
--from_pretrained result/ytbvln_2e5_500_MRT_ranking_30M/data/29.bin \
--save_name ytbvln_2e5_500_MRT_ranking_30M_30RS \
--shuffle_visual_features \
--ranking \
--batch_size 16 \
--num_epochs 30
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.launch \
--nproc_per_node 8 \
--master_port 5555 \
-m train \
--from_pretrained result/ytbvln_2e5_500_MRT_ranking_30M/data/29.bin \
--save_name ytbvln_2e5_500_MRT_ranking_30M_30RSA \
--prefix aug+ \
--beam_prefix aug_ \
--shuffle_visual_features \
--ranking \
--batch_size 16 \
--num_epochs 30
CUDA_VISIBLE_DEVICES=0 python test.py \
--from_pretrained result/ytbvln_2e5_500_MRT_ranking_30M_30RSA/data/best_unseen.bin \
--save_name ytbvln_2e5_500_MRT_ranking_30M_30RSA \
--split val_unseen
python scripts/calculate-metrics.py results/ytbvln_2e5_500_MRT_ranking_30M_30RSA/test_val_unseen/_results_val_unseen.json
Here we provide our trained model, feel free to test it.
If you find this work helpful, please kindly consider citing our paper:
@article{lin2023ytbvln,
title = {Learning Vision-and-Language Navigation from YouTube Videos},
author = {Lin, Kunyang and Chen, Peihao and Huang, Diwei and Li, Thomas H. and Tan, Mingkui and Gan, Chuang},
journal = {arXiv preprint arXiv:2307.11984},
year = {2023},
}
@misc{lin2023ytbvln,
title = {Learning Vision-and-Language Navigation from YouTube Videos},
author = {Lin, Kunyang and Chen, Peihao and Huang, Diwei and Li, Thomas H. and Tan, Mingkui and Gan, Chuang},
howpublished = {\url{https://github.com/JeremyLinky/YouTube-VLN}},
year = {2023},
}
Our code is partially modified from Airbert, video-dqn and Probes-VLN. Thanks for their awesome works and please consider citing them at the same time.
For any questions, please feel free to file an issue or contact:revolving_hearts::
Kunyang Lin: imkunyanglin@gmail.com
Diwei Huang: sediweihuang@gmail.com