allenai/s2orc

number citation

zjhuang22 opened this issue · 2 comments

Hi, thanks for the great work, I wonder is the number citation of a paper provided in the datasets? What I mean is, paper A, the number of times paper A cited by other papers.

We provide a notion of inbound/outbound citations to/from papers within the dataset itself (see metadata). These citations are designed to help people identify papers where citation contexts may exist, and are not a very good representation of the "total" number of citations a paper has.

@zjhuang22 This is just several lines of code to compute for every paper:

paper_id_to_num_citations = {}
with open('full/metadata/metadata_0.jsonl') as f_in:
    for line in f_in:
        metadata_dict = json.loads(line)
        paper_id = metadata_dict['paper_id']
        paper_id_to_num_citations[paper_id] = len(metadata_dict['inbound_citations'])

but like Lucy said, the citation count will probably differ from what you see across many websites, like Google or Semantic Scholar. What we give you is the citation counts with respect to the subset of papers in our S2ORC collection for which we have citation information, not the true citation count of papers, which... I'm not sure anyone really knows.

That being said, I still just counted a total of 467M+ citation edges in the dataset, so hopefully that's enough coverage for you