pymupdf/PyMuPDF

pymupdf.open() processes .zip file without raising

Closed this issue · 7 comments

Description of the bug

Attempts to open a .zip file using pymupdf.open() succeed, leading to unexpected results.

How to reproduce the bug

To reproduce, use this code:

import pymupdf
import tempfile

zipfile_content = b'PK\x03\x04\n\x00\x00\x00\x00\x00\x19U0[\xf40\x8b&\x1b\x00\x00\x00\x1b\x00\x00\x00\x08\x00\x1c\x00textfileUT\t\x00\x03\x92"\xc9h\x94"\xc9hux\x0b\x00\x01\x04\xf5\x01\x00\x00\x04\x14\x00\x00\x00This is a plain text file.\nPK\x01\x02\x1e\x03\n\x00\x00\x00\x00\x00\x19U0[\xf40\x8b&\x1b\x00\x00\x00\x1b\x00\x00\x00\x08\x00\x18\x00\x00\x00\x00\x00\x01\x00\x00\x00\xa4\x81\x00\x00\x00\x00textfileUT\x05\x00\x03\x92"\xc9hux\x0b\x00\x01\x04\xf5\x01\x00\x00\x04\x14\x00\x00\x00PK\x05\x06\x00\x00\x00\x00\x01\x00\x01\x00N\x00\x00\x00]\x00\x00\x00\x00\x00'

tmpfile = tempfile.NamedTemporaryFile(suffix='.zip', delete=True)
with open(tmpfile.name, 'wb') as f:
    f.write(zipfile_content)

with pymupdf.open(tmpfile.name) as doc:
    print(f"doc.page_count={doc.page_count}")

This (very short!) .zip file contains one plain text file.

The code executes cleanly and prints" doc.page_count=0'.

Expectation: PyMuPDF would recognize that the file content is not a PDF and raise.

PyMuPDF will fail when using pymupdf.open(tmpfile, filetype='pdf') in this example. But:

(1) We'd expect it to fail even without specifying filetype, I'd hope...?
(2) With longer .zip files it succeeds even with specifying filetype='pdf', indicating (in the instance I tried) that there were 120 PDF pages in the .zip file. (And to be clear, there was no pdf content in that zip file. I can share if needed, but expect the behavior documented here to be problematic enough to merit a fix).

PyMuPDF version

1.26.4

Operating system

MacOS

Python version

3.13

(1) We'd expect it to fail even without specifying filetype, I'd hope...?

No: PyMuPDF can deal with a dozen of different file formats and recognizes many of them by inspecting the content. Many supported formats are ZIP-based (XPS, EPUB, etc.) and are thus recognized / opened.

(2) With longer .zip files it succeeds even with specifying filetype='pdf', ...

This sounds weird. Would need the example.

If you need assertion something is indeed a PDF, you can check doc.is_pdf after open.

Thanks for the quick reaction. Understood re the other ZIP-based file formats.

This time I've attached a large .zip file. It happens to be the output of Adobe's PDF Extract API: a json file with structured content extract, plus two directories with .png and .csv files for images and tables. (Just sitting around from experiments with a too-expensive API... ;-) ).

NI-000092336-adobe_pdf_extract.zip

My code to reproduce:

from pathlib import Path
import pymupdf

fp_to_file = Path(...)

with pymupdf.open(fp_to_file, filetype='pdf') as doc:
    print(f"doc.page_count={doc.page_count}")

The output I get:

doc.page_count=120

My guess here is MuPDF is thinking this is one of the handled zip-based file formats, and looks for things it thinks it should be able to handle? But it finds nothing.

for page in doc:
     assert page.get_text('blocks') == []

Indeed, the file is recognized as not a PDF (doc.is_pdf returns False) but I did pass filetype='pdf' as an argument to pymupdf.open(), so with specific designation of filetype do you want failure? That's what happened with the tiny earlier example file.

But it finds nothing.

Oh, it finds a lot - just no text. But images: If you use page.get_text("blocks", flags=pymupdf.TEXT_PRESERVE_IMAGES).

But let me check why we don't reject filetype="pdf".

It definitely contains no PDF files whatsoever. It is no official / public file format, but maybe something produced by Adobe Extractor - which we support in an undocumented way. The internal structure of the ZIP is

  • folder "figures"
  • folder "tables"
  • file "structuredData.json"

Here is an update:

  • Many supported Document types internally are ZIP archives. This includes Office files, XPS, FB2, CBZ and some more.
  • CBZ is a very simple format - basically representing a collection of images - one per page. Other files may be present, but are ignored. CBZ is not formally standardized: any ZIP archive containing images has a fat chance to be accepted as CBZ by many or most viewers.
  • During open, MuPDF inspects the file content to see if it matches a known type. This happens by passing the content to an internal list of "document handlers": the first one shouting "I can handle it" (together with a high enough score in case of competing handlers) is accepted.
  • A ZIP archive is not natively a supported Document type in MuPDF. However, if it contains at least one image file, it will be treated like a CBZ, see above. Otherwise, if no document handler can recognize it, the file is technically opened without an assigned handler. Typical Document attributes will remain unused, like page_count=0.

Primarily, we have a documentation issue here:
If the file content has been accepted by one of the document handlers, any filetype specification is ignored. It will never be used to assert that the document format is as expected. Its only purpose in life is helping to find the right document handler for unclear file content cases.
The role of filetype has been important before MuPDF introduced file content inspection. Its residual use today is mostly restricted to imposing a desired format on text-type files: "txt", "html", "xml", etc., which are notoriously hard to detect via content inspection.

Fixed in PyMuPDF-1.26.5.