pymupdf/PyMuPDF

page.cluster_drawings extract a lot of small clusters once upgraded to 1.26

Opened this issue · 2 comments

Description of the bug

I have a function that extract the clustered drawings from a PDF.
This function takes much longer time after after upgraded to 1.26.0 (and 1.26.3)

Here is a simplified version of the function to isolate the problem

    def _get_clustered_drawings(
        self, page: fitz.Page
    ) -> List[ImageType]:
        for clip in page.cluster_drawings():
            print(clip)

How to reproduce the bug

This is a simple PDF exported from Perplexity (with images)
what is the tallest mountain on earth.pdf

When using getting the clustered drawings pymupdf<1.26.0, I get 3 clustered drawings and the speed feels 'normal'.

Rect(75.75, 222.75, 79.5, 226.5)
Rect(75.75, 243.75, 79.5, 247.5)
Rect(75.75, 296.25, 79.5, 300.0)

With version >= 1.26.0, the clusters, I get this long list of clusters with significantly longer time.
The problem magnifies for a longer PDF with more images.

Rect(68.42168426513672, 104.17657470703125, 114.84117889404297, 118.68190002441406)
Rect(140.42724609375, 104.17657470703125, 170.42828369140625, 118.72410583496094)
Rect(239.57530212402344, 103.84222412109375, 326.4324951171875, 118.72410583496094)
Rect(332.87890625, 107.958251953125, 354.9324951171875, 118.72410583496094)
Rect(361.37890625, 104.17657470703125, 409.683837890625, 118.72410583496094)
Rect(120.88815307617188, 103.84222412109375, 135.1599884033203, 118.734619140625)
Rect(175.67724609375, 104.17657470703125, 233.34117126464844, 118.734619140625)
...

PyMuPDF version

1.26.3

Operating system

MacOS

Python version

3.12

Confirming your observation.
This only happens when there is text written with a Type 3 font. In this case, the vector graphics representing the Type 3 character are being included in the .get_drawings() extraction. For example in the following picture the red rectangle is the vector and the blue rectangle is the character bbox:

Image

We are currently investigating with the MuPDF team ...