pymupdf/PyMuPDF

get_images() returns 0 images for a page that has an image

Closed this issue · 3 comments

Description of the bug

I have a test pdf file I created that looks like it contains one image. The entire page is a screenshot that I saved and then converted to a pdf, but I don't remember how I did it since it was several years ago. The only ways I can think of was using Word, or Print to PDF from a browser.

In any case, I tested it two ways, I tried getting the text blocks from the page and there are none.

I also created a method to detect if page contains images or not and after I ran my code, I found that it would not detect the image in this one pdf file.

So according to my code it contains no text, but also contains no images. Attached is the file and below is the code I used to detect images on the page.


import pymupdf
import fitz
print(pymupdf.__doc__)
file_path = r"D:\SOFTWARE_DEVELOPMENT\_APPS\temp\debug_files_for_ocr\testpdf_image1.pdf"

doc = pymupdf.open(file_path) 

def get_images(file_name: str) -> float:

    total_page_area = 0.0
    total_text_area = 0.0

    doc = fitz.open(file_name)

    for page_index in range(len(doc)): # iterate over pdf pages
        page = doc[page_index] # get the page
        total_page_area = total_page_area + abs(page.bound())
        print("total page area: ", total_page_area)
        image_list = page.get_images(full=True)

        # print the number of images found on the page
        if image_list:
            print(f"Found {len(image_list)} images on page {page_index}")
            # Iterate through the images on the page
            for img in image_list:
                #print(page.get_image_bbox(img))
                bbox = page.get_image_bbox(img)  # Get the bounding box of the image
                area = bbox.width * bbox.height  # Calculate the area of the image
                print("image area:", area)
        else:
            print("No images found on page", page_index)

    doc.close()
    return 

get_images(file_path)


How to reproduce the bug

Run code provided on attached file, result should be

total page area:  484704.0
No images found on page 0

testpdf_image1.pdf

PyMuPDF version

1.26.3

Operating system

Windows

Python version

3.13

Wrong statement 😎 in the title. Should be "get_images() returns 0 images for a page that has no images".
The page has indeed no images.
It also contains no text.
What it contains lots of instead are vector graphics that mimic characters by drawing them.

@JorjMcKie Apologies, I didn't consider svg graphics, I suppose I can use the get_drawings method to detect them instead.

No worries!
Yes, get_drawings() extracts vectors on a detail level. And get_svg_image generates an SVG image of the page (a string).
SVG syntax is XML format, whereas get_drawings delivers a list of Python dictionaries.