allenai/scibert

Pretrained sciroBERTa weights release in the works?

Closed this issue ยท 16 comments

Given the success of roBERTa https://ai.facebook.com/blog/roberta-an-optimized-method-for-pretraining-self-supervised-nlp-systems/ in GLUE benchmarks and alike, is a training of roBERTa over the semantic scholar corpus planned for release in the foreseeable future, in this repo?

Otherwise, can someone provide hints about how to train a roBERTa model on the semantic scholar corpus and the compute time needed for the purpose? Thanks!

The Semantic Scholar corpus only has abstracts, I believe they made their own dataset from publicly available biomed/cs papers.

The paper may list where these sources are, then you could probably take the SciBert weights, and use the HuggingFace Transformers to do RoBertA training.

Hey @davidefiocco, we're working on a scientific roBERTa & will release when it's finished. The prime concern is how long it'll take without TPUs or a ridiculous amount of GPUs. We're currently working with modifications of the Huggingface code + PyTorch XLA to get TPU training working, but that code is still under active development.

@Santosh-Gupta, actually the released SciBERT weights were trained on full text parses of the PDFs. Unfortunately, we didn't consider the viability of releasing that corpus when working on the project previously & due to copyright issues, aren't able to release the pretraining corpus. We're very close to releasing a large pretraining corpus of full text, so stay tuned

@kyleclo let me know if you need a hand with figuring out incorporating the TPUs. I started trying to figure out the optimal way of incorporating Pytorch XLA with Hugginface a few weeks ago, but fell off it pytorch/xla#1217
huggingface/transformers#1540 (comment)

Or, if it's already been figured out, I would love to check it out.

actually the released SciBERT weights were trained on full text parses of the PDFs

Ah I didn't know that. Was the raw text just read from the PDF, or was there some sort of special parsing program used, such as the Allen AI Science Parse.

we're working on a scientific roBERTa

Very cool! Since Semantic Scholar has expanded the fields available, would they be included in SciBertA? Or would it be Biomed+CS?

We're very close to releasing a large pretraining corpus of full text

Very much looking forward to this one!

@Santosh-Gupta, we are using this library https://github.com/allenai/tpu_pretrain for PyTorch/XLA TPU training

Thanks!

Sounds really interesting :) I've a question regarding to the vocab generation. I guess you're also using your own vocab, how did you manage to generate it? (I think RoBERTa uses the GPT2 vocab, and OpenAI did not provide any tools for generating it...)

with roberta, we are not going to change the vocab because we want to continue pretraining not start from scratch.

Is there a chance SciBertA could be released before end of the year?

I'm also interested in if there is an update on a planned SciroBERTa release?

It's already been released, it's on the huggingface pretrained models repo

@Santosh-Gupta I saw there are new models at https://huggingface.co/models?search=allenai but are these documented somehow?

@dmolony3, are you referring to a sci- version of RoBERTa ? we didn't release that but we have a biomedical one: https://huggingface.co/allenai/biomed_roberta_base

@ibeltagy I was referring to a sci version of RoBERTa but the biomed will work just as well for my purpose. Thanks!

I saw a cs_roberta base model on hugging face but could not find any documentation regarding it. Is it a pretrained Roberta model trained on computer science scientific domain papers?
If yes, Is it pretrained from scratch using a new vocab, or continued pertaining was done?

I believe it's described in the paper named something like don't stop pretraining

@Soumyajain29 If you are interested in domain RoBERTa, two papers are recommended:

  1. Don't Stop Pretraining: Adapt Language Models to Domains and Tasks (ACL 2020)
  2. Taming Pre-trained Language Models with N-gram Representations for Low-Resource Domain Adaptation (ACL 2021)