google-research/bigbird

Question about pre-trained weights

patrickvonplaten opened this issue · 3 comments

Thanks so much for releasing BigBird!

Quick question about the pre-trained weights. Do the bigbr_large and bigbr_base correspond to BERT-like encoder-only checkpoints and bigbp_large to the encoder-decoder version?

Yes, you are correct. I should have provided a more detailed documentation.

  • bigbr_large and bigbr_base correspond to BERT/RoBERTa-like encoder only models. Following original BERT and RoBERTa implementation they are transformers with post-normalization, i.e. layer norm is happening after the attention layer. However, following Rothe et al, we can use them partially in encoder-decoder fashion by coupling the encoder and decoder parameters, as illustrated in bigbird/summarization/roberta_base.sh launch script.
  • bigbp_large is Pegasus-like encoder-decoder model. Again following original implementation of Pegasus, they are transformers with pre-normalization. They have full set of separate encoder-decoder weights.

I have updated the readme and I am closing this issue for now. Feel free to re-open if there are any further questions.

Awesome thanks so much for the quick reply :-)