facebookresearch/fairseq

BART Pretraining Script

gyuwankim opened this issue · 21 comments

❓ Questions and Help

First of all, thanks for the sharing BART model checkpoints and codes to run.

What is your question?

Could you provide a pertaining script used for BART models?

I hope to train the BART model in my own language.
(Of course, I am aware of mBART models that support other languages, but my target task is not MT so I believe training BART on other language data only might be better.)
Although I could figure out configurations base on the paper, it is prone to miss some important details for training.
Training scripts like RoBERTa (https://github.com/pytorch/fairseq/blob/master/examples/roberta/README.pretraining.md) would be highly beneficial.

Thanks a lot in advance!

❓ Questions and Help

First of all, thanks for the sharing BART model checkpoints and codes to run.

What is your question?

Could you provide a pertaining script used for BART models?

I hope to train the BART model in my own language.
(Of course, I am aware of mBART models that support other languages, but my target task is not MT so I believe training BART on other language data only might be better.)
Although I could figure out configurations base on the paper, it is prone to miss some important details for training.
Training scripts like RoBERTa (https://github.com/pytorch/fairseq/blob/master/examples/roberta/README.pretraining.md) would be highly beneficial.

Thanks a lot in advance!

Hello,

I am also very interested in training a customized BART. Have you got any updates?

I'm also very interested in pretraining script. Any update ? @ngoyal2707 @yinhanliu

Hi is there any updates on BART pertaining script?

I'm also highly interested in this, is there any update?

stale commented

This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment!

any update?

any update?

Any update please?

Sorry for the (very) slow reply, this is actually the first time someone pointed me at this issue!

This command should set the hyperparameters from the original training run, though I haven't tested it with a recent fairseq, so you may need to fiddle with it a bit:
python -O train.py $DATA --fp16 --mask 0.3 --tokens-per-sample 512 --total-num-update 500000 --max-update 500000 --warmup-updates 10000 --task denoising --save-interval 1 --short-seq-prob 0.0 --arch denoising_large --optimizer adam --lr-scheduler polynomial_decay --lr 0.0004 --min-lr 1e-09 --dropout 0.1 --criterion cross_entropy --max-tokens 3200 --weight-decay 0.01 --attention-dropout 0.1 --share-all-embeddings --clip-norm 0.1 --skip-invalid-size-inputs-valid-test --log-format json --log-interval 1000 --save-interval-updates 5000 --keep-interval-updates 1 --update-freq 4 --seed 4 --distributed-world-size 256 --distributed-port 54187 --no-epoch-checkpoints --mask-length span-poisson --replace-length 1 --encoder-learned-pos --decoder-learned-pos --rotate 0.0 --mask-random 0.1 --permute-sentences 1.0 --insert 0.0 --poisson-lambda 3.5 --dataset-impl mmap --bpe gpt2 --num-workers 4

Hope that helps!

Sorry for the (very) slow reply, this is actually the first time someone pointed me at this issue!

This command should set the hyperparameters from the original training run, though I haven't tested it with a recent fairseq, so you may need to fiddle with it a bit: python -O train.py $DATA --fp16 --mask 0.3 --tokens-per-sample 512 --total-num-update 500000 --max-update 500000 --warmup-updates 10000 --task denoising --save-interval 1 --short-seq-prob 0.0 --arch denoising_large --optimizer adam --lr-scheduler polynomial_decay --lr 0.0004 --min-lr 1e-09 --dropout 0.1 --criterion cross_entropy --max-tokens 3200 --weight-decay 0.01 --attention-dropout 0.1 --share-all-embeddings --clip-norm 0.1 --skip-invalid-size-inputs-valid-test --log-format json --log-interval 1000 --save-interval-updates 5000 --keep-interval-updates 1 --update-freq 4 --seed 4 --distributed-world-size 256 --distributed-port 54187 --no-epoch-checkpoints --mask-length span-poisson --replace-length 1 --encoder-learned-pos --decoder-learned-pos --rotate 0.0 --mask-random 0.1 --permute-sentences 1.0 --insert 0.0 --poisson-lambda 3.5 --dataset-impl mmap --bpe gpt2 --num-workers 4

Hope that helps!

Thank You Mike for updating us with the pre-training code. I think we should remove --min-lr 1e-09 because it cause the training to finish before even started and also to omit --short-seq-prob 0.0. Also batch size with these settings is 6.4K not 8192 as RoBERTA. Batch size=(max_token * update-freq * world size) / ( Max Seq. Length (tokens-per-sample) = 3200*4*256=3,276,800 /512= 6400 . Also i think mask value should be 0.15 for base

Sorry for the (very) slow reply, this is actually the first time someone pointed me at this issue!

This command should set the hyperparameters from the original training run, though I haven't tested it with a recent fairseq, so you may need to fiddle with it a bit: python -O train.py $DATA --fp16 --mask 0.3 --tokens-per-sample 512 --total-num-update 500000 --max-update 500000 --warmup-updates 10000 --task denoising --save-interval 1 --short-seq-prob 0.0 --arch denoising_large --optimizer adam --lr-scheduler polynomial_decay --lr 0.0004 --min-lr 1e-09 --dropout 0.1 --criterion cross_entropy --max-tokens 3200 --weight-decay 0.01 --attention-dropout 0.1 --share-all-embeddings --clip-norm 0.1 --skip-invalid-size-inputs-valid-test --log-format json --log-interval 1000 --save-interval-updates 5000 --keep-interval-updates 1 --update-freq 4 --seed 4 --distributed-world-size 256 --distributed-port 54187 --no-epoch-checkpoints --mask-length span-poisson --replace-length 1 --encoder-learned-pos --decoder-learned-pos --rotate 0.0 --mask-random 0.1 --permute-sentences 1.0 --insert 0.0 --poisson-lambda 3.5 --dataset-impl mmap --bpe gpt2 --num-workers 4

Hope that helps!

This command leads to many issues, as you may have suspected:

  • --arch denoising_large does not exist. bart_large might be a good alternative
  • --permute_sentences apparently can't be 1.0: --permute-sentences: invalid int value: '1.0', has to be 1 instead
  • unrecognized arguments: --short-seq-prob 0.0 --min-lr 1e-09

So final command:

python train.py $DATA --fp16 --mask 0.3 --tokens-per-sample 512 --total-num-update 500000 --max-update 500000 --warmup-updates 10000 --task denoising --save-interval 1 --arch bart_base --optimizer adam --lr-scheduler polynomial_decay --lr 0.0004 --dropout 0.1 --criterion cross_entropy --max-tokens 3200 --weight-decay 0.01 --attention-dropout 0.1 --share-all-embeddings --clip-norm 0.1 --skip-invalid-size-inputs-valid-test --log-format json --log-interval 1000 --save-interval-updates 5000 --keep-interval-updates 1 --update-freq 4 --seed 4 --distributed-world-size 256 --distributed-port 54187 --no-epoch-checkpoints --mask-length span-poisson --replace-length 1 --encoder-learned-pos --decoder-learned-pos --rotate 0.0 --mask-random 0.1 --permute-sentences 1 --insert 0.0 --poisson-lambda 3.5 --dataset-impl mmap --bpe gpt2 --num-workers 4

@mikelewis0, what is the average loss at the end of pre-training?

Could someone who has pre-trained BART share the loss and target language?
I am pre-training for Portuguese and would like to know if an average loss of around 2 is ok.

IIRC it was a little under 2 on English. It's not very meaningful to compare these numbers across languages, as it will be strongly influenced by how the data is tokenized.

Yes sure! I was curious about how far the model could go in the loss function, because I'm only pre-training for 5 epochs due to hardware limitations (5,927 steps per epoch = 29,635 steps for training). This is far from what you did (500k steps) based on RoBERTa paper.

Thanks for the answer!

I have pre-trained T5 and BART and it totally depend on the corpora and masking portion you are using. A larger corpora means the loss function need more time to capture the contextual representation so the loss function tend to be higher. Using a high masking portion (>15%) will also increase your loss score. It also worth noting that loss score has nothing to do with performance on downstream tasks. For example, you can pre-train BART on only 50MB, and the loss score will be very very low but the performance on downstream will be very very poor because you need at least 13GB (similar to BERT ), to have enough to capture the contextual representation for effective transfer learning approach. With masking portion of 15%, loss score tend to be below 0.5%.

Sorry for the (very) slow reply, this is actually the first time someone pointed me at this issue!
This command should set the hyperparameters from the original training run, though I haven't tested it with a recent fairseq, so you may need to fiddle with it a bit: python -O train.py $DATA --fp16 --mask 0.3 --tokens-per-sample 512 --total-num-update 500000 --max-update 500000 --warmup-updates 10000 --task denoising --save-interval 1 --short-seq-prob 0.0 --arch denoising_large --optimizer adam --lr-scheduler polynomial_decay --lr 0.0004 --min-lr 1e-09 --dropout 0.1 --criterion cross_entropy --max-tokens 3200 --weight-decay 0.01 --attention-dropout 0.1 --share-all-embeddings --clip-norm 0.1 --skip-invalid-size-inputs-valid-test --log-format json --log-interval 1000 --save-interval-updates 5000 --keep-interval-updates 1 --update-freq 4 --seed 4 --distributed-world-size 256 --distributed-port 54187 --no-epoch-checkpoints --mask-length span-poisson --replace-length 1 --encoder-learned-pos --decoder-learned-pos --rotate 0.0 --mask-random 0.1 --permute-sentences 1.0 --insert 0.0 --poisson-lambda 3.5 --dataset-impl mmap --bpe gpt2 --num-workers 4
Hope that helps!

This command leads to many issues, as you may have suspected:

  • --arch denoising_large does not exist. bart_large might be a good alternative
  • --permute_sentences apparently can't be 1.0: --permute-sentences: invalid int value: '1.0', has to be 1 instead
  • unrecognized arguments: --short-seq-prob 0.0 --min-lr 1e-09

So final command:

python train.py $DATA --fp16 --mask 0.3 --tokens-per-sample 512 --total-num-update 500000 --max-update 500000 --warmup-updates 10000 --task denoising --save-interval 1 --arch bart_base --optimizer adam --lr-scheduler polynomial_decay --lr 0.0004 --dropout 0.1 --criterion cross_entropy --max-tokens 3200 --weight-decay 0.01 --attention-dropout 0.1 --share-all-embeddings --clip-norm 0.1 --skip-invalid-size-inputs-valid-test --log-format json --log-interval 1000 --save-interval-updates 5000 --keep-interval-updates 1 --update-freq 4 --seed 4 --distributed-world-size 256 --distributed-port 54187 --no-epoch-checkpoints --mask-length span-poisson --replace-length 1 --encoder-learned-pos --decoder-learned-pos --rotate 0.0 --mask-random 0.1 --permute-sentences 1 --insert 0.0 --poisson-lambda 3.5 --dataset-impl mmap --bpe gpt2 --num-workers 4

thank you for your scripts, I also run successfully!!!

@mikelewis0 Hi, Mike. I am a little confused about the sentence permutation in denoising_dataset.py. The full_stop_index only contains the eos token rather than the full stop token.

Sorry for the (very) slow reply, this is actually the first time someone pointed me at this issue!
This command should set the hyperparameters from the original training run, though I haven't tested it with a recent fairseq, so you may need to fiddle with it a bit: python -O train.py $DATA --fp16 --mask 0.3 --tokens-per-sample 512 --total-num-update 500000 --max-update 500000 --warmup-updates 10000 --task denoising --save-interval 1 --short-seq-prob 0.0 --arch denoising_large --optimizer adam --lr-scheduler polynomial_decay --lr 0.0004 --min-lr 1e-09 --dropout 0.1 --criterion cross_entropy --max-tokens 3200 --weight-decay 0.01 --attention-dropout 0.1 --share-all-embeddings --clip-norm 0.1 --skip-invalid-size-inputs-valid-test --log-format json --log-interval 1000 --save-interval-updates 5000 --keep-interval-updates 1 --update-freq 4 --seed 4 --distributed-world-size 256 --distributed-port 54187 --no-epoch-checkpoints --mask-length span-poisson --replace-length 1 --encoder-learned-pos --decoder-learned-pos --rotate 0.0 --mask-random 0.1 --permute-sentences 1.0 --insert 0.0 --poisson-lambda 3.5 --dataset-impl mmap --bpe gpt2 --num-workers 4
Hope that helps!

This command leads to many issues, as you may have suspected:

  • --arch denoising_large does not exist. bart_large might be a good alternative
  • --permute_sentences apparently can't be 1.0: --permute-sentences: invalid int value: '1.0', has to be 1 instead
  • unrecognized arguments: --short-seq-prob 0.0 --min-lr 1e-09

So final command:

python train.py $DATA --fp16 --mask 0.3 --tokens-per-sample 512 --total-num-update 500000 --max-update 500000 --warmup-updates 10000 --task denoising --save-interval 1 --arch bart_base --optimizer adam --lr-scheduler polynomial_decay --lr 0.0004 --dropout 0.1 --criterion cross_entropy --max-tokens 3200 --weight-decay 0.01 --attention-dropout 0.1 --share-all-embeddings --clip-norm 0.1 --skip-invalid-size-inputs-valid-test --log-format json --log-interval 1000 --save-interval-updates 5000 --keep-interval-updates 1 --update-freq 4 --seed 4 --distributed-world-size 256 --distributed-port 54187 --no-epoch-checkpoints --mask-length span-poisson --replace-length 1 --encoder-learned-pos --decoder-learned-pos --rotate 0.0 --mask-random 0.1 --permute-sentences 1 --insert 0.0 --poisson-lambda 3.5 --dataset-impl mmap --bpe gpt2 --num-workers 4

thank you for your scripts, I also run successfully!!!

I'd like to know is the pretrain a further pretrain(training on a pretrained Bart)

We've released nanoT5 that reproduces T5-model (similar to BART) pre-training in PyTorch (not Flax).

You can take a look!

Any suggestions are more than welcome.