linjieli222/HERO

How long does it take for pre-training on TV with MLM+MNCE from scratch?

HenryHZY opened this issue · 4 comments

@linjieli222
Hi, thanks for your great project!
As mentioned in your paper, the best pre-trained HERO needs to be trained on 16 V100 GPUs for about 3 weeks.
Due to the limitation of GPU and memory, I would like to conduct pre-training on TV with MLM+MNCE firstly. (that is, L2 in Table 1 in your paper)

I would like to ask three questions:

  1. How long does it take for pre-training on TV with MLM+MNCE from scratch? (L2 in Table 1 in your paper)

  2. Could you please show me the commands to conduct pre-training on TV with MLM+MNCE and fine-tuning on TVR from scratch? I am a novice in pre-training projects. :)

    I think I need to conduct this experiment by 7 steps:

    1/ download TV dataset
    2/ Text & Video feature extraction from TV dataset
      or directly use the Text & Video features provided by you
    3/ pre-training on TV with MLM+MNCE
    
    4/ download TVR dataset
    5/ Text & Video feature extraction from TVR dataset
      or directly use the Text & Video features provided by you
    6/ fine-tuning & inference on TVR
    7/ submit results to TVR codalab
    
  3. I find that the downloading of bash scripts/download_tvr.sh $PATH_TO_STORAGE is too slow, less than 1m/s. Do you have another download server?
    [Done. No need to reply this question.]

@linjieli222
For question 2, are the following commands correct? (Just copy from your README.md)

1/ download TV dataset
2/ Text & Video feature extraction from TV dataset
Here, I directly use the Text & Video features provided by you:

# outside of the container
bash scripts/download_tv_pretrain.sh $PATH_TO_STORAGE

3/ pre-training on TV with MLM+MNCE

# inside of the container
horovodrun -np 16 python pretrain.py --config config/pretrain-tv-16gpu.json --output_dir $PRETRAIN_EXP

"tasks": ["mlm", "mfm-nce", "fom", "vsm"],

from
"tasks": ["mlm", "mfm-nce", "fom", "vsm"]
to
"tasks": ["mlm", "mfm-nce"]

4/ download TVR dataset
5/ Text & Video feature extraction from TVR dataset
Here, I directly use the Text & Video features provided by you

bash scripts/download_tvr.sh $PATH_TO_STORAGE

6/ fine-tuning & inference on TVR

# fine-tunin, inside the container
horovodrun -np 8 python train_vcmr.py --config config/train-tvr-8gpu.json

# inference, inside the container
horovodrun -np 8 python eval_vcmr.py --query_txt_db /txt/tvr_val.db/ --split val \
    --vfeat_db /video/tv/ --sub_txt_db /txt/tv_subtitles.db/ \
    --output_dir /storage/tvr_default/ --checkpoint 4800 --fp16 --pin_mem

7/ submit results to TVR codalab

It was more a year ago when we conducted the pretraining ablation experiments. From what I recall, it may take about 2-3 day on 8 GPUs.

Note that you will need to reduce the pre-training steps by half for MLM+MFM-NCE if you want to strictly follow our settings in the pre-training ablation table.

And remember to change the pretrained checkpoints in the config/train-tvr-8gpu.json for finetuning.

Another useful information, please use azcopy to download, if you ever find it slow. You can refer to VALUE-Leaderboard/StarterCode/scripts/download_tvr.sh.

It was more a year ago when we conducted the pretraining ablation experiments. From what I recall, it may take about 2-3 day on 8 GPUs.

Note that you will need to reduce the pre-training steps by half for MLM+MFM-NCE if you want to strictly follow our settings in the pre-training ablation table.

And remember to change the pretrained checkpoints in the config/train-tvr-8gpu.json for finetuning.

Another useful information, please use azcopy to download, if you ever find it slow. You can refer to VALUE-Leaderboard/StarterCode/scripts/download_tvr.sh.

Thanks for your quick reply!
The VALUE is really a great project, which contains VALUE-StarterCode and VALUE-DataRelease.
Maybe I could use the VALUE-StarterCode for a better beginning of my adventure towards video pre-training.

I would like to temporarily close this issue, and reopen it if there are any other questions later, thanks again.