Use a more robust PDF library

Question

Use a more robust PDF library

Mr0grog opened this issue 4 years ago · 4 comments

We currently read PDF titles with PyPDF2, which hasn’t been updated in two years and hasn’t had a release in four!

While it’s the most broadly used pure Python library for reading PDFs, it has some shortcomings, including the inability to decrypt a lot of PDFs — it only supports a subset of PDF’s standard encryption schemes. That includes a lot of the PDFs we monitor (e.g. http://web.archive.org/web/20201102132143id_/https://nca2014.globalchange.gov/system/files_force/downloads/low/NCA3_Full_Report_11_Urban_Systems_and_Infrastructure_LowRes.pdf?download=1).

We should look into some other tool for reading PDF data. What are the good options out there, and what are their pros/cons?

One I found in a quick search was PyMuPDF, which is a Python wrapper for MuPDF. Both appear to be actively maintained. It seems to work with the above file:

>>> import fitz  # The import is named fitz because reasons.
>>> pdf = fitz.Document(stream=pdf_bytes, filetype='application/pdf')
>>> pdf.metadata['title']
'Climate Change Impacts in the United States'

Answer 1 · 2020-11-05T01:20:28.000Z

There is also the much-more low-level pdfminer (have not tested), and there are probably also wrappers for Java tools like xpdf and pdftk, although it would be nice not to require Java.

Answer 2 · 2020-11-05T01:41:11.000Z

Some more (untested):

pdfrw (README says it’s incomplete with decryption & decompression, but worth checking out.)
pdfreader
pyxpdf wrapper around xpdf
pdf4py doesn’t have much history, so I don’t have high expectations.

Answer 3 · 2020-11-30T06:09:31.000Z

Was reading this Internet Archive blog post today: https://blog.archive.org/2020/11/23/foss-wins-again-free-and-open-source-communities-comes-through-on-19th-century-newspapers-and-books-and-periodicals/ which describes MuPDF and PyMuPDF quite positively:

The PDFs themselves are created using the high-performance mupdf and pymupdf python library: both projects were supportive and promptly fixed various bugs, which propelled our efforts forwards.

Answer 4 · 2023-01-02T20:34:53.000Z

Since this issue was written, PyPDF2 has gained new maintainers, has huge improvements, and seems to successfully decrypt all the problematic PDFs we had that I was aware of. It’s probably not worth switching anymore.