Downloading PDFs in JSON/XML format
arghyadeep99 opened this issue · 5 comments
Is there any way to download the PDFs in JSON/XML format, to ensure the structure remains when using it for any machine learning task, where separation of sections may become necessary? Looking forward to some solutions. Thank you!
There is pdftohtml
which you can install in ubuntu and which can convert PDFs into XML or HTML. You can probably just replace this tool in our pipeline wherever pdf2text
is used. There is also pdfalto
. No one tool will work for all documents, there will always be tradeoffs. In our pipeline we use a few PDF to text tools as sometimes one fails or hangs, for example.
Hi @arghyadeep99, please advise whether your issue is resolved by my suggestion. If so, feel free to close this issue.
Yes, thanks. I will close the issue.
Hello! For arXiv articles, GROBID should be used rather than pdfalto. pdfalto is just a preprocessing at layout level without information loss, it does not structure (it's like PDF.js
or pdftohtml
).
GROBID will structure the document and generate an XML version for text mining, for building a citation network, for matching affiliation, etc.
As mentioned, GROBID is a tradeoff, but it works quite well with arXiv papers if the motivation is then text mining, you can test the online demo.
Thanks for the suggestion, @kermitt2. Please feel free to submit a PR enabling the use of GROBID in the pipeline.