Question about pre-trained weights
patrickvonplaten opened this issue · 3 comments
patrickvonplaten commented
Thanks so much for releasing BigBird!
Quick question about the pre-trained weights. Do the bigbr_large
and bigbr_base
correspond to BERT-like encoder-only checkpoints and bigbp_large
to the encoder-decoder version?
manzilz commented
Yes, you are correct. I should have provided a more detailed documentation.
bigbr_large
andbigbr_base
correspond to BERT/RoBERTa-like encoder only models. Following original BERT and RoBERTa implementation they are transformers with post-normalization, i.e. layer norm is happening after the attention layer. However, following Rothe et al, we can use them partially in encoder-decoder fashion by coupling the encoder and decoder parameters, as illustrated inbigbird/summarization/roberta_base.sh
launch script.bigbp_large
is Pegasus-like encoder-decoder model. Again following original implementation of Pegasus, they are transformers with pre-normalization. They have full set of separate encoder-decoder weights.
manzilz commented
I have updated the readme and I am closing this issue for now. Feel free to re-open if there are any further questions.
patrickvonplaten commented
Awesome thanks so much for the quick reply :-)