edgi-govdata-archiving/web-monitoring-processing

Use a more robust PDF library

Mr0grog opened this issue · 4 comments

We currently read PDF titles with PyPDF2, which hasn’t been updated in two years and hasn’t had a release in four!

While it’s the most broadly used pure Python library for reading PDFs, it has some shortcomings, including the inability to decrypt a lot of PDFs — it only supports a subset of PDF’s standard encryption schemes. That includes a lot of the PDFs we monitor (e.g. http://web.archive.org/web/20201102132143id_/https://nca2014.globalchange.gov/system/files_force/downloads/low/NCA3_Full_Report_11_Urban_Systems_and_Infrastructure_LowRes.pdf?download=1).

We should look into some other tool for reading PDF data. What are the good options out there, and what are their pros/cons?


One I found in a quick search was PyMuPDF, which is a Python wrapper for MuPDF. Both appear to be actively maintained. It seems to work with the above file:

>>> import fitz  # The import is named fitz because reasons.
>>> pdf = fitz.Document(stream=pdf_bytes, filetype='application/pdf')
>>> pdf.metadata['title']
'Climate Change Impacts in the United States'

There is also the much-more low-level pdfminer (have not tested), and there are probably also wrappers for Java tools like xpdf and pdftk, although it would be nice not to require Java.

Some more (untested):

  • pdfrw (README says it’s incomplete with decryption & decompression, but worth checking out.)
  • pdfreader
  • pyxpdf wrapper around xpdf
  • pdf4py doesn’t have much history, so I don’t have high expectations.

Was reading this Internet Archive blog post today: https://blog.archive.org/2020/11/23/foss-wins-again-free-and-open-source-communities-comes-through-on-19th-century-newspapers-and-books-and-periodicals/ which describes MuPDF and PyMuPDF quite positively:

The PDFs themselves are created using the high-performance mupdf and pymupdf python library: both projects were supportive and promptly fixed various bugs, which propelled our efforts forwards.

Since this issue was written, PyPDF2 has gained new maintainers, has huge improvements, and seems to successfully decrypt all the problematic PDFs we had that I was aware of. It’s probably not worth switching anymore.