Question about pre-trained weights

Question

Question about pre-trained weights

patrickvonplaten opened this issue 4 years ago · 3 comments

Thanks so much for releasing BigBird!

Quick question about the pre-trained weights. Do the bigbr_large and bigbr_base correspond to BERT-like encoder-only checkpoints and bigbp_large to the encoder-decoder version?

Answer 1 · 2020-12-13T20:14:15.000Z

Yes, you are correct. I should have provided a more detailed documentation.

bigbr_large and bigbr_base correspond to BERT/RoBERTa-like encoder only models. Following original BERT and RoBERTa implementation they are transformers with post-normalization, i.e. layer norm is happening after the attention layer. However, following Rothe et al, we can use them partially in encoder-decoder fashion by coupling the encoder and decoder parameters, as illustrated in bigbird/summarization/roberta_base.sh launch script.
bigbp_large is Pegasus-like encoder-decoder model. Again following original implementation of Pegasus, they are transformers with pre-normalization. They have full set of separate encoder-decoder weights.

Answer 2 · 2020-12-14T18:43:43.000Z

I have updated the readme and I am closing this issue for now. Feel free to re-open if there are any further questions.

Answer 3 · 2020-12-14T19:06:11.000Z

Awesome thanks so much for the quick reply :-)