Replicate for recent data from Arxiv and Openalex
shubhamagarwal92 opened this issue · 2 comments
shubhamagarwal92 commented
Hi!
Thanks for open-sourcing the code!
I would like to replicate the data for a small dataset of recently published work.
I have a list of recent arxiv ids. How do I align them with the OpenAlex ids? I would also like to extract the citation graph and align with the arxiv ids for getting parsed full content.
Is there an easy way without downloading huge amount of data (6TB arxiv and 300GB OpenAlex) for papers published say in August 2023?
IllDepence commented
Hi,
- I would assume the arXiv bulk access methods also include the possibility to selectively download only parts of the data, as this is needed for incremental updates of a local source dump.
- For OpenAlex I don’t think there’s a way around getting all of it, because the subset you’d want is all the papers references in your arXiv subset — and figuring out this subset is waht the reference matching to create the citation graph does (/edited)
If you have that data prepared the unarXive pipeline should be able to process if just fine. For creating the whole 1991–2022 data set we also processed multiple chunks of the data in parallel.
Hope this helps. :)
shubhamagarwal92 commented
Thanks for your answer! Closing this now.