Downloading PDFs in JSON/XML format

Question

Downloading PDFs in JSON/XML format

arghyadeep99 opened this issue 3 years ago · 5 comments

Is there any way to download the PDFs in JSON/XML format, to ensure the structure remains when using it for any machine learning task, where separation of sections may become necessary? Looking forward to some solutions. Thank you!

Answer 1 · 2021-06-29T17:42:59.000Z

There is pdftohtml which you can install in ubuntu and which can convert PDFs into XML or HTML. You can probably just replace this tool in our pipeline wherever pdf2text is used. There is also pdfalto. No one tool will work for all documents, there will always be tradeoffs. In our pipeline we use a few PDF to text tools as sometimes one fails or hangs, for example.

Answer 2 · 2021-08-05T23:39:59.000Z

Hi @arghyadeep99, please advise whether your issue is resolved by my suggestion. If so, feel free to close this issue.

Answer 3 · 2021-08-06T06:10:43.000Z

Yes, thanks. I will close the issue.

Answer 4 · 2021-11-13T19:22:37.000Z

Hello! For arXiv articles, GROBID should be used rather than pdfalto. pdfalto is just a preprocessing at layout level without information loss, it does not structure (it's like PDF.js or pdftohtml).
GROBID will structure the document and generate an XML version for text mining, for building a citation network, for matching affiliation, etc.
As mentioned, GROBID is a tradeoff, but it works quite well with arXiv papers if the motivation is then text mining, you can test the online demo.

Answer 5 · 2022-02-09T21:56:48.000Z

Thanks for the suggestion, @kermitt2. Please feel free to submit a PR enabling the use of GROBID in the pipeline.