jacksongoode/NIME-proceedings-analyzer

Integrate download links from 2021 onwards (PubPub)

Closed this issue · 3 comments

Now that NIME has moved to PubPub we need to parse the source of the PubPub urls to find documents for the papers. However, this may also be an opportunity in that XML files are already provided by PubPub. Thus it might be possible to skip Grobid for these new papers.

XML files provided by PubPub are likely to provide more accurate data and less errors. However, the code of the proceedings-analyzer code may have to be updated/fixed every time PubPub will change something in their XML files (it may happen frequently since PubPub is still a pretty new platform). Perhaps sticking to PDF files can provide a longer longevity/compatibility to the proceedings-analyzer.

Good point. One oddity is that the PDFs generated by PubPub may not be well formed all the time - in fact our paper 20 NIMES is malformed and the PDF parser from pdfminer (a good one) isn't able to accept it. I'm thinking that it might be possible to attempt to fix the PDF with a library like pikepdf (that uses qpdf).

pikepdf can be a good (temporary) workaround.