allenai/scibert

Original papers from Semantic Scholar

Closed this issue · 1 comments

Hey, just wondering if you still have the original documents from Semantic Scholar available, or if you could let me know how to go about compiling such a dataset? Thanks.

Hi jadeshi,
Sorry, we unfortunately can't make a lot of the original documents from Semantic Scholar available because of licensing issues. To get a similar corpus, one could get free full text biomedical papers from PubMed PMC https://www.ncbi.nlm.nih.gov/pmc/ and ACL Anthology https://acl-arc.comp.nus.edu.sg/ and arXiv https://arxiv.org/help/bulk_data. You can use any PDF-to-Text tool to get paper full text, and then process that text using https://github.com/allenai/SciSpaCy, as described in our paper. Hope this helps!