allenai/specter

Matching articles from SPECTER's dataset with S2ORC IDs

zoranmedic opened this issue · 1 comments

Hi,

I want to match articles used in SPECTER's training and validation sets with the articles from S2ORC.
The problem is that article IDs in SPECTER's training and validation sets are not used in the S2ORC dataset, i.e., S2ORC uses different paper IDs compared to SPECTER.

For example, this article can be found in SPECTER's validation set and its ID there is: 793efec2096f6511c45430ff5f2f08a362dcf3eb.
Corpus ID of this paper is 11967120 and this Corpus ID is used in S2ORC as paper_id. (I've found this Corpus ID on the Semantic Scholar's webpage linked above)

Is there any easy way to obtain these Corpus IDs for articles from SPECTER's dataset?
I'm aware I could use Semantic Scholar's API for this, but I think that would be very time-consuming (SPECTER's dataset contains over 165k unique article IDs if I calculated correctly).

Thanks!

This JSON should contain the S2ORC Ids for most papers from SPECTER's training set: https://github.com/malteos/scincl/releases/download/0.1/specter__s2id_to_s2orc_paper_id.json.gz