Pretrained sciroBERTa weights release in the works?

Question

Pretrained sciroBERTa weights release in the works?

Closed this issue 4 years ago · 16 comments

Given the success of roBERTa https://ai.facebook.com/blog/roberta-an-optimized-method-for-pretraining-self-supervised-nlp-systems/ in GLUE benchmarks and alike, is a training of roBERTa over the semantic scholar corpus planned for release in the foreseeable future, in this repo?

Otherwise, can someone provide hints about how to train a roBERTa model on the semantic scholar corpus and the compute time needed for the purpose? Thanks!

Santosh-Gupta commented 5 years ago

Thanks!

Answer 1 · 2019-11-07T03:30:10.000Z

The Semantic Scholar corpus only has abstracts, I believe they made their own dataset from publicly available biomed/cs papers.

The paper may list where these sources are, then you could probably take the SciBert weights, and use the HuggingFace Transformers to do RoBertA training.

Answer 2 · 2019-11-07T03:41:32.000Z

Hey @davidefiocco, we're working on a scientific roBERTa & will release when it's finished. The prime concern is how long it'll take without TPUs or a ridiculous amount of GPUs. We're currently working with modifications of the Huggingface code + PyTorch XLA to get TPU training working, but that code is still under active development.

@Santosh-Gupta, actually the released SciBERT weights were trained on full text parses of the PDFs. Unfortunately, we didn't consider the viability of releasing that corpus when working on the project previously & due to copyright issues, aren't able to release the pretraining corpus. We're very close to releasing a large pretraining corpus of full text, so stay tuned

Answer 3 · 2019-11-07T04:32:21.000Z

@kyleclo let me know if you need a hand with figuring out incorporating the TPUs. I started trying to figure out the optimal way of incorporating Pytorch XLA with Hugginface a few weeks ago, but fell off it pytorch/xla#1217
huggingface/transformers#1540 (comment)

Or, if it's already been figured out, I would love to check it out.

actually the released SciBERT weights were trained on full text parses of the PDFs

Ah I didn't know that. Was the raw text just read from the PDF, or was there some sort of special parsing program used, such as the Allen AI Science Parse.

we're working on a scientific roBERTa

Very cool! Since Semantic Scholar has expanded the fields available, would they be included in SciBertA? Or would it be Biomed+CS?

We're very close to releasing a large pretraining corpus of full text

Very much looking forward to this one!

Answer 4 · 2019-11-07T06:26:30.000Z

@Santosh-Gupta, we are using this library https://github.com/allenai/tpu_pretrain for PyTorch/XLA TPU training

Answer 5 · 2019-11-07T14:25:01.000Z

Sounds really interesting :) I've a question regarding to the vocab generation. I guess you're also using your own vocab, how did you manage to generate it? (I think RoBERTa uses the GPT2 vocab, and OpenAI did not provide any tools for generating it...)

Answer 6 · 2019-11-07T14:35:14.000Z

with roberta, we are not going to change the vocab because we want to continue pretraining not start from scratch.

Answer 7 · 2019-12-15T08:01:09.000Z

Is there a chance SciBertA could be released before end of the year?

Answer 8 · 2020-06-16T13:23:09.000Z

I'm also interested in if there is an update on a planned SciroBERTa release?

Answer 9 · 2020-06-16T19:07:04.000Z

It's already been released, it's on the huggingface pretrained models repo

Answer 10 · 2020-06-16T22:11:28.000Z

@Santosh-Gupta I saw there are new models at https://huggingface.co/models?search=allenai but are these documented somehow?

Answer 11 · 2020-06-17T00:55:43.000Z

@dmolony3, are you referring to a sci- version of RoBERTa ? we didn't release that but we have a biomedical one: https://huggingface.co/allenai/biomed_roberta_base

Answer 12 · 2020-06-17T12:06:28.000Z

@ibeltagy I was referring to a sci version of RoBERTa but the biomed will work just as well for my purpose. Thanks!

Answer 13 · 2021-07-24T10:34:51.000Z

I saw a cs_roberta base model on hugging face but could not find any documentation regarding it. Is it a pretrained Roberta model trained on computer science scientific domain papers?
If yes, Is it pretrained from scratch using a new vocab, or continued pertaining was done?

Answer 14 · 2021-07-26T06:15:15.000Z

I believe it's described in the paper named something like don't stop pretraining

Answer 15 · 2021-07-27T00:05:06.000Z

@Soumyajain29 If you are interested in domain RoBERTa, two papers are recommended:

Don't Stop Pretraining: Adapt Language Models to Domains and Tasks (ACL 2020)
Taming Pre-trained Language Models with N-gram Representations for Low-Resource Domain Adaptation (ACL 2021)