BART Pretraining Script

Question

BART Pretraining Script

gyuwankim opened this issue 4 years ago · 21 comments

❓ Questions and Help

First of all, thanks for the sharing BART model checkpoints and codes to run.

What is your question?

Could you provide a pertaining script used for BART models?

I hope to train the BART model in my own language.
(Of course, I am aware of mBART models that support other languages, but my target task is not MT so I believe training BART on other language data only might be better.)
Although I could figure out configurations base on the paper, it is prone to miss some important details for training.
Training scripts like RoBERTa (https://github.com/pytorch/fairseq/blob/master/examples/roberta/README.pretraining.md) would be highly beneficial.

Thanks a lot in advance!

Answer 1 · 2020-04-14T13:41:16.000Z

@ngoyal2707 @yinhanliu

Answer 2 · 2020-06-19T07:58:14.000Z

❓ Questions and Help

First of all, thanks for the sharing BART model checkpoints and codes to run.

What is your question?

Could you provide a pertaining script used for BART models?

I hope to train the BART model in my own language.
(Of course, I am aware of mBART models that support other languages, but my target task is not MT so I believe training BART on other language data only might be better.)
Although I could figure out configurations base on the paper, it is prone to miss some important details for training.
Training scripts like RoBERTa (https://github.com/pytorch/fairseq/blob/master/examples/roberta/README.pretraining.md) would be highly beneficial.

Thanks a lot in advance!

Hello,

I am also very interested in training a customized BART. Have you got any updates?

Answer 3 · 2020-06-29T06:42:40.000Z

I'm also very interested in pretraining script. Any update ? @ngoyal2707 @yinhanliu

Answer 4 · 2020-07-23T10:19:20.000Z

Hi is there any updates on BART pertaining script?

Answer 5 · 2021-01-25T02:43:39.000Z

I'm also highly interested in this, is there any update?

Answer 6 · 2021-07-21T03:04:38.000Z

This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment!

Answer 7 · 2021-07-22T08:50:30.000Z

any update?

Answer 8 · 2022-02-19T13:56:55.000Z

any update?

Answer 9 · 2022-03-14T15:22:48.000Z

Any update please?

Answer 10 · 2022-03-16T18:30:24.000Z

Sorry for the (very) slow reply, this is actually the first time someone pointed me at this issue!

This command should set the hyperparameters from the original training run, though I haven't tested it with a recent fairseq, so you may need to fiddle with it a bit:
python -O train.py $DATA --fp16 --mask 0.3 --tokens-per-sample 512 --total-num-update 500000 --max-update 500000 --warmup-updates 10000 --task denoising --save-interval 1 --short-seq-prob 0.0 --arch denoising_large --optimizer adam --lr-scheduler polynomial_decay --lr 0.0004 --min-lr 1e-09 --dropout 0.1 --criterion cross_entropy --max-tokens 3200 --weight-decay 0.01 --attention-dropout 0.1 --share-all-embeddings --clip-norm 0.1 --skip-invalid-size-inputs-valid-test --log-format json --log-interval 1000 --save-interval-updates 5000 --keep-interval-updates 1 --update-freq 4 --seed 4 --distributed-world-size 256 --distributed-port 54187 --no-epoch-checkpoints --mask-length span-poisson --replace-length 1 --encoder-learned-pos --decoder-learned-pos --rotate 0.0 --mask-random 0.1 --permute-sentences 1.0 --insert 0.0 --poisson-lambda 3.5 --dataset-impl mmap --bpe gpt2 --num-workers 4

Hope that helps!

Answer 11 · 2022-03-18T02:49:14.000Z

Sorry for the (very) slow reply, this is actually the first time someone pointed me at this issue!

This command should set the hyperparameters from the original training run, though I haven't tested it with a recent fairseq, so you may need to fiddle with it a bit: python -O train.py $DATA --fp16 --mask 0.3 --tokens-per-sample 512 --total-num-update 500000 --max-update 500000 --warmup-updates 10000 --task denoising --save-interval 1 --short-seq-prob 0.0 --arch denoising_large --optimizer adam --lr-scheduler polynomial_decay --lr 0.0004 --min-lr 1e-09 --dropout 0.1 --criterion cross_entropy --max-tokens 3200 --weight-decay 0.01 --attention-dropout 0.1 --share-all-embeddings --clip-norm 0.1 --skip-invalid-size-inputs-valid-test --log-format json --log-interval 1000 --save-interval-updates 5000 --keep-interval-updates 1 --update-freq 4 --seed 4 --distributed-world-size 256 --distributed-port 54187 --no-epoch-checkpoints --mask-length span-poisson --replace-length 1 --encoder-learned-pos --decoder-learned-pos --rotate 0.0 --mask-random 0.1 --permute-sentences 1.0 --insert 0.0 --poisson-lambda 3.5 --dataset-impl mmap --bpe gpt2 --num-workers 4

Hope that helps!

Thank You Mike for updating us with the pre-training code. I think we should remove --min-lr 1e-09 because it cause the training to finish before even started and also to omit --short-seq-prob 0.0. Also batch size with these settings is 6.4K not 8192 as RoBERTA. Batch size=(max_token * update-freq * world size) / ( Max Seq. Length (tokens-per-sample) = 3200*4*256=3,276,800 /512= 6400 . Also i think mask value should be 0.15 for base

Answer 12 · 2022-03-18T13:59:02.000Z

Hi. Thank you so much. I appreciate your help.

…

On Fri, Mar 18, 2022 at 3:49 AM Sultan ***@***.***> wrote: Sorry for the (very) slow reply, this is actually the first time someone pointed me at this issue! This command should set the hyperparameters from the original training run, though I haven't tested it with a recent fairseq, so you may need to fiddle with it a bit: python -O train.py $DATA --fp16 --mask 0.3 --tokens-per-sample 512 --total-num-update 500000 --max-update 500000 --warmup-updates 10000 --task denoising --save-interval 1 --short-seq-prob 0.0 --arch denoising_large --optimizer adam --lr-scheduler polynomial_decay --lr 0.0004 --min-lr 1e-09 --dropout 0.1 --criterion cross_entropy --max-tokens 3200 --weight-decay 0.01 --attention-dropout 0.1 --share-all-embeddings --clip-norm 0.1 --skip-invalid-size-inputs-valid-test --log-format json --log-interval 1000 --save-interval-updates 5000 --keep-interval-updates 1 --update-freq 4 --seed 4 --distributed-world-size 256 --distributed-port 54187 --no-epoch-checkpoints --mask-length span-poisson --replace-length 1 --encoder-learned-pos --decoder-learned-pos --rotate 0.0 --mask-random 0.1 --permute-sentences 1.0 --insert 0.0 --poisson-lambda 3.5 --dataset-impl mmap --bpe gpt2 --num-workers 4 Hope that helps! I think we should remove --min-lr 1e-09 because it cause the training to finish before even started and also to omit --short-seq-prob 0.0. Also batch size with settings is 6.4K not 8192 as RoBERTA. Batch size=(max_token * update-freq * world size) / ( Max Seq. Length (tokens-per-sample) = 3200 *4*256=3,276,800 /512= 6400 . Also i think mask value should be 0.15 for base — Reply to this email directly, view it on GitHub <#1899 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ATYMA2QRZZFQIC4CB3HFVZDVAPVLNANCNFSM4LSLKLVQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you commented.Message ID: ***@***.***>

Answer 13 · 2022-09-05T17:25:15.000Z

Sorry for the (very) slow reply, this is actually the first time someone pointed me at this issue!

This command should set the hyperparameters from the original training run, though I haven't tested it with a recent fairseq, so you may need to fiddle with it a bit: python -O train.py $DATA --fp16 --mask 0.3 --tokens-per-sample 512 --total-num-update 500000 --max-update 500000 --warmup-updates 10000 --task denoising --save-interval 1 --short-seq-prob 0.0 --arch denoising_large --optimizer adam --lr-scheduler polynomial_decay --lr 0.0004 --min-lr 1e-09 --dropout 0.1 --criterion cross_entropy --max-tokens 3200 --weight-decay 0.01 --attention-dropout 0.1 --share-all-embeddings --clip-norm 0.1 --skip-invalid-size-inputs-valid-test --log-format json --log-interval 1000 --save-interval-updates 5000 --keep-interval-updates 1 --update-freq 4 --seed 4 --distributed-world-size 256 --distributed-port 54187 --no-epoch-checkpoints --mask-length span-poisson --replace-length 1 --encoder-learned-pos --decoder-learned-pos --rotate 0.0 --mask-random 0.1 --permute-sentences 1.0 --insert 0.0 --poisson-lambda 3.5 --dataset-impl mmap --bpe gpt2 --num-workers 4

Hope that helps!

This command leads to many issues, as you may have suspected:

--arch denoising_large does not exist. bart_large might be a good alternative
--permute_sentences apparently can't be 1.0: --permute-sentences: invalid int value: '1.0', has to be 1 instead
unrecognized arguments: --short-seq-prob 0.0 --min-lr 1e-09

So final command:

python train.py $DATA --fp16 --mask 0.3 --tokens-per-sample 512 --total-num-update 500000 --max-update 500000 --warmup-updates 10000 --task denoising --save-interval 1 --arch bart_base --optimizer adam --lr-scheduler polynomial_decay --lr 0.0004 --dropout 0.1 --criterion cross_entropy --max-tokens 3200 --weight-decay 0.01 --attention-dropout 0.1 --share-all-embeddings --clip-norm 0.1 --skip-invalid-size-inputs-valid-test --log-format json --log-interval 1000 --save-interval-updates 5000 --keep-interval-updates 1 --update-freq 4 --seed 4 --distributed-world-size 256 --distributed-port 54187 --no-epoch-checkpoints --mask-length span-poisson --replace-length 1 --encoder-learned-pos --decoder-learned-pos --rotate 0.0 --mask-random 0.1 --permute-sentences 1 --insert 0.0 --poisson-lambda 3.5 --dataset-impl mmap --bpe gpt2 --num-workers 4

Answer 14 · 2022-10-21T14:37:22.000Z

@mikelewis0, what is the average loss at the end of pre-training?

Could someone who has pre-trained BART share the loss and target language?
I am pre-training for Portuguese and would like to know if an average loss of around 2 is ok.

Answer 15 · 2022-10-21T17:42:39.000Z

IIRC it was a little under 2 on English. It's not very meaningful to compare these numbers across languages, as it will be strongly influenced by how the data is tokenized.

Answer 16 · 2022-10-23T13:54:32.000Z

Yes sure! I was curious about how far the model could go in the loss function, because I'm only pre-training for 5 epochs due to hardware limitations (5,927 steps per epoch = 29,635 steps for training). This is far from what you did (500k steps) based on RoBERTa paper.

Thanks for the answer!

Answer 17 · 2022-10-23T14:07:16.000Z

I have pre-trained T5 and BART and it totally depend on the corpora and masking portion you are using. A larger corpora means the loss function need more time to capture the contextual representation so the loss function tend to be higher. Using a high masking portion (>15%) will also increase your loss score. It also worth noting that loss score has nothing to do with performance on downstream tasks. For example, you can pre-train BART on only 50MB, and the loss score will be very very low but the performance on downstream will be very very poor because you need at least 13GB (similar to BERT ), to have enough to capture the contextual representation for effective transfer learning approach. With masking portion of 15%, loss score tend to be below 0.5%.

Answer 18 · 2022-11-19T07:10:31.000Z

Sorry for the (very) slow reply, this is actually the first time someone pointed me at this issue!
This command should set the hyperparameters from the original training run, though I haven't tested it with a recent fairseq, so you may need to fiddle with it a bit: python -O train.py $DATA --fp16 --mask 0.3 --tokens-per-sample 512 --total-num-update 500000 --max-update 500000 --warmup-updates 10000 --task denoising --save-interval 1 --short-seq-prob 0.0 --arch denoising_large --optimizer adam --lr-scheduler polynomial_decay --lr 0.0004 --min-lr 1e-09 --dropout 0.1 --criterion cross_entropy --max-tokens 3200 --weight-decay 0.01 --attention-dropout 0.1 --share-all-embeddings --clip-norm 0.1 --skip-invalid-size-inputs-valid-test --log-format json --log-interval 1000 --save-interval-updates 5000 --keep-interval-updates 1 --update-freq 4 --seed 4 --distributed-world-size 256 --distributed-port 54187 --no-epoch-checkpoints --mask-length span-poisson --replace-length 1 --encoder-learned-pos --decoder-learned-pos --rotate 0.0 --mask-random 0.1 --permute-sentences 1.0 --insert 0.0 --poisson-lambda 3.5 --dataset-impl mmap --bpe gpt2 --num-workers 4
Hope that helps!

This command leads to many issues, as you may have suspected:

--arch denoising_large does not exist. bart_large might be a good alternative

--permute_sentences apparently can't be 1.0: --permute-sentences: invalid int value: '1.0', has to be 1 instead

unrecognized arguments: --short-seq-prob 0.0 --min-lr 1e-09

So final command:
python train.py $DATA --fp16 --mask 0.3 --tokens-per-sample 512 --total-num-update 500000 --max-update 500000 --warmup-updates 10000 --task denoising --save-interval 1 --arch bart_base --optimizer adam --lr-scheduler polynomial_decay --lr 0.0004 --dropout 0.1 --criterion cross_entropy --max-tokens 3200 --weight-decay 0.01 --attention-dropout 0.1 --share-all-embeddings --clip-norm 0.1 --skip-invalid-size-inputs-valid-test --log-format json --log-interval 1000 --save-interval-updates 5000 --keep-interval-updates 1 --update-freq 4 --seed 4 --distributed-world-size 256 --distributed-port 54187 --no-epoch-checkpoints --mask-length span-poisson --replace-length 1 --encoder-learned-pos --decoder-learned-pos --rotate 0.0 --mask-random 0.1 --permute-sentences 1 --insert 0.0 --poisson-lambda 3.5 --dataset-impl mmap --bpe gpt2 --num-workers 4

thank you for your scripts, I also run successfully!!!

Answer 19 · 2022-12-04T09:13:55.000Z

@mikelewis0 Hi, Mike. I am a little confused about the sentence permutation in denoising_dataset.py. The full_stop_index only contains the eos token rather than the full stop token.

Answer 20 · 2023-02-14T14:11:33.000Z

Sorry for the (very) slow reply, this is actually the first time someone pointed me at this issue!
This command should set the hyperparameters from the original training run, though I haven't tested it with a recent fairseq, so you may need to fiddle with it a bit: python -O train.py $DATA --fp16 --mask 0.3 --tokens-per-sample 512 --total-num-update 500000 --max-update 500000 --warmup-updates 10000 --task denoising --save-interval 1 --short-seq-prob 0.0 --arch denoising_large --optimizer adam --lr-scheduler polynomial_decay --lr 0.0004 --min-lr 1e-09 --dropout 0.1 --criterion cross_entropy --max-tokens 3200 --weight-decay 0.01 --attention-dropout 0.1 --share-all-embeddings --clip-norm 0.1 --skip-invalid-size-inputs-valid-test --log-format json --log-interval 1000 --save-interval-updates 5000 --keep-interval-updates 1 --update-freq 4 --seed 4 --distributed-world-size 256 --distributed-port 54187 --no-epoch-checkpoints --mask-length span-poisson --replace-length 1 --encoder-learned-pos --decoder-learned-pos --rotate 0.0 --mask-random 0.1 --permute-sentences 1.0 --insert 0.0 --poisson-lambda 3.5 --dataset-impl mmap --bpe gpt2 --num-workers 4
Hope that helps!

This command leads to many issues, as you may have suspected:

--arch denoising_large does not exist. bart_large might be a good alternative

--permute_sentences apparently can't be 1.0: --permute-sentences: invalid int value: '1.0', has to be 1 instead

unrecognized arguments: --short-seq-prob 0.0 --min-lr 1e-09

So final command:
python train.py $DATA --fp16 --mask 0.3 --tokens-per-sample 512 --total-num-update 500000 --max-update 500000 --warmup-updates 10000 --task denoising --save-interval 1 --arch bart_base --optimizer adam --lr-scheduler polynomial_decay --lr 0.0004 --dropout 0.1 --criterion cross_entropy --max-tokens 3200 --weight-decay 0.01 --attention-dropout 0.1 --share-all-embeddings --clip-norm 0.1 --skip-invalid-size-inputs-valid-test --log-format json --log-interval 1000 --save-interval-updates 5000 --keep-interval-updates 1 --update-freq 4 --seed 4 --distributed-world-size 256 --distributed-port 54187 --no-epoch-checkpoints --mask-length span-poisson --replace-length 1 --encoder-learned-pos --decoder-learned-pos --rotate 0.0 --mask-random 0.1 --permute-sentences 1 --insert 0.0 --poisson-lambda 3.5 --dataset-impl mmap --bpe gpt2 --num-workers 4
thank you for your scripts, I also run successfully!!!

I'd like to know is the pretrain a further pretrain(training on a pretrained Bart)

Answer 21 · 2023-03-16T16:27:34.000Z

We've released nanoT5 that reproduces T5-model (similar to BART) pre-training in PyTorch (not Flax).

You can take a look!

Any suggestions are more than welcome.