Possible to publish internal-citations.json.gz?

Question

Possible to publish internal-citations.json.gz?

Closed this issue a year ago · 6 comments

turian commented a year ago

This is a very useful document on its own, and should be relatively small.

Versus running the entire pipepline to compute this file, could it be shared? Perhaps on hugging face?

Answer 1 · 2023-08-08T19:12:31.000Z

This is available in this releases, see internal-references-v0.2.0-2019-03-01.json.gz.

Answer 2 · 2023-08-09T19:48:00.000Z

Thank you. Are you aware of any more recent crawls?

Answer 3 · 2023-08-12T18:58:22.000Z

Unfortunately not. You can re-run it yourself with the PDF dump on Kaggle for free, though the process is quite slow. I think we ran it originally on a 96 core machine and it took half a day or so. Extracting citations from the LaTex source would be much faster, but you'll have to pony up ~$100 to AWS for egress.

Answer 4 · 2023-10-18T01:35:17.000Z

Is there a script for doing that with latex? I was thinking of just grabbing the cocitations for 23/22. Using virginia EC2 on AWS should cut down on egress costs by 1/9th. Happy to publish the result on Kaggle

I'm a bit surprised arxiv doesn't invest more in this, as it would strongly encourage publishing and cociting on arxiv.

Answer 5 · 2023-11-02T13:23:07.000Z

(... ended up here by mere chance while looking up some references wrt arXiv’s history)

@qrdlgit not sure if it is exactly what you’re looking for regarding the focus on co-citations, but I have a project here that does convert the LaTeX sources of arXiv to structured document representations + a citation network. for a constrained time frame (e.g. just a year or month) the document conversion should be straight forward, but the generation of the citation network relies on a local dump of part of OpenAlex which requires some space and a few steps to set up.

Answer 6 · 2023-11-28T07:03:51.000Z

Is there a script for doing that with latex? I was thinking of just grabbing the cocitations for 23/22. Using virginia EC2 on AWS should cut down on egress costs by 1/9th. Happy to publish the result on Kaggle

I'm a bit surprised arxiv doesn't invest more in this, as it would strongly encourage publishing and cociting on arxiv.

All of the scripts we used to generate all the data presented are available in this repo. The script can download the latex if you switch one parameter, but we did not create tooling for parsing the bib files, though this should be as easy as looking for arXiv IDs. If you would like to contribute an updated version that would be awesome.