allenai/scibert

Pretraining SciBERT

Closed this issue ยท 8 comments

Hi,
The repo does not seem to contain the codes to pretrain the model on semantic scholar. Do you plan to release those codes and the pretrain data? Thanks!

Yichong

We used the BERT code from google https://github.com/google-research/bert

Thanks! And where did you get the data?

As mentioned in the paper, we used the semantic scholar corpus which is not publicly available. The publicly available part is this https://api.semanticscholar.org/corpus/, which has titles and abstracts but not full text.

@ibeltagy thanks for the reply. What mlm_loss should I expect the model to converge to if I used the same dataset?

With 512 tokens, the losses are around the following numbers:

loss = 1.311045
masked_lm_accuracy = 0.7187241
masked_lm_loss = 1.2882848
next_sentence_accuracy = 0.9939219
next_sentence_loss = 0.0196654```

@ibeltagy Thanks for reply. Does that mean that your pretrained Sci-Bert model reaches masked_lm_accuracy of around 0.718? However, in the BERT original model, they reaches around 0.98 masked_lm_accuracy and about 1.0 next_sentence accuracy. May I ask do you think a masked_lm_accuracy of around 0.718 is enough?
I am training also my own model on a customized dataset, which adds around 1000 new tokens that are not in BERT model. Currently my model also reaches about over 0.7 masked_lm_accuracy and improves very slowly since then. Thus I would like to know what is the masked_lm_accuracy or next_sentence_accuracy that I should expect for my pre-trained model to achieve. Are there any tricks for fine-tuning the pretrained model on customized corpus?

@sibyl1956 This is a good point and I suspect this is due to the noisy PDF parse in the scientific corpus. Namely, we didn't do anything to remove tables, equations, weird tokens, etc. that was output by PDFBox in converting the raw PDF to a text stream. You can expect it's essentially impossible for the model to really predict these masked tokens. We're currently in the process of curating an updated larger & cleaner version of the pretraining corpus, and will investigate whether the noisy tokens are the cause for this. As it is now, the current released SciBERT weights are still very good for downstream tasks (but we can definitely do better).

I'm closing this issue for now since it looks like the original question chain was answered. Feel free to reopen or start new Issue