About citation matching
Zivenzhu opened this issue · 5 comments
Dear developers,
I am now utilizing your unarxive dataset for my project. However, I have found it hard to match a paper with its citing papers. To be more specific, many papers' bib_entreis don't contaion much information related to the cited papers and most of them only have the 'bib_entry_raw'. I firstly constructed a list containing all the papers' titles. Then I looped through all the papers's bib_entries. In a loop, I scanned the list to see if a certain paper's title is in the string of the cited paper's bib_entry_raw. However, some bib_entry_raws don't contain the cited papers' titles but have other information such as venues or year of being published, making it difficult to match papers.
Could you please shed some light on how to match a paper with its citing papers. Your reply is highly appreciated!
Hi @37david,
our dataset creation process already performs matching of bib_entries.
From the README:
63 M references (28 M linked to OpenAlex)
134 M in-text citation markers (65 M linked)
for details please see our initial and most recent paper.
Thank you for your reply. Actually I have noticed some mathed papers in the bib_entries only have some information like open_alex_id and sem_open_alex_id. ( Please see in the image below. ) However, the url of open_alex_id and sem_open_alex_id seem hard to use to match the paper in the dataset as these information is not contained in each paper's dictionary.
to match the paper in the dataset
Please note that authors may cite papers that are not on arXiv.org.
A matched bib_entry always has an open_alex_id, but the rest of the information is dependent on what information the OpenAlex metadata provides about the paper (e.g., if an arXiv ID of the paper is known/exists or not). For the paper in your screenshot no arXiv ID is given by OpenAlex as you can see here (whole paper record).
For all papers where OpenAlex provides an arXiv ID, it is provided in unarXive.
As a result, we only need to use the arxiv_id to find the paper in unarXive dataset. That would be great. Thanks again for your kind help.
Correct, simply use the arxiv_id
attribute to identify citations to papers “within” the data set (i.e. those citation relations where you have full text on both sides).
No worries.