page.cluster_drawings extract a lot of small clusters once upgraded to 1.26
Opened this issue · 2 comments
Description of the bug
I have a function that extract the clustered drawings from a PDF.
This function takes much longer time after after upgraded to 1.26.0 (and 1.26.3)
Here is a simplified version of the function to isolate the problem
def _get_clustered_drawings(
self, page: fitz.Page
) -> List[ImageType]:
for clip in page.cluster_drawings():
print(clip)
How to reproduce the bug
This is a simple PDF exported from Perplexity (with images)
what is the tallest mountain on earth.pdf
When using getting the clustered drawings pymupdf<1.26.0, I get 3 clustered drawings and the speed feels 'normal'.
Rect(75.75, 222.75, 79.5, 226.5)
Rect(75.75, 243.75, 79.5, 247.5)
Rect(75.75, 296.25, 79.5, 300.0)
With version >= 1.26.0, the clusters, I get this long list of clusters with significantly longer time.
The problem magnifies for a longer PDF with more images.
Rect(68.42168426513672, 104.17657470703125, 114.84117889404297, 118.68190002441406)
Rect(140.42724609375, 104.17657470703125, 170.42828369140625, 118.72410583496094)
Rect(239.57530212402344, 103.84222412109375, 326.4324951171875, 118.72410583496094)
Rect(332.87890625, 107.958251953125, 354.9324951171875, 118.72410583496094)
Rect(361.37890625, 104.17657470703125, 409.683837890625, 118.72410583496094)
Rect(120.88815307617188, 103.84222412109375, 135.1599884033203, 118.734619140625)
Rect(175.67724609375, 104.17657470703125, 233.34117126464844, 118.734619140625)
...
PyMuPDF version
1.26.3
Operating system
MacOS
Python version
3.12
Confirming your observation.
This only happens when there is text written with a Type 3 font. In this case, the vector graphics representing the Type 3 character are being included in the .get_drawings() extraction. For example in the following picture the red rectangle is the vector and the blue rectangle is the character bbox:
We are currently investigating with the MuPDF team ...
MuPDF issue link: https://bugs.ghostscript.com/show_bug.cgi?id=708875