get_images() returns 0 images for a page that has an image
Closed this issue · 3 comments
Description of the bug
I have a test pdf file I created that looks like it contains one image. The entire page is a screenshot that I saved and then converted to a pdf, but I don't remember how I did it since it was several years ago. The only ways I can think of was using Word, or Print to PDF from a browser.
In any case, I tested it two ways, I tried getting the text blocks from the page and there are none.
I also created a method to detect if page contains images or not and after I ran my code, I found that it would not detect the image in this one pdf file.
So according to my code it contains no text, but also contains no images. Attached is the file and below is the code I used to detect images on the page.
import pymupdf
import fitz
print(pymupdf.__doc__)
file_path = r"D:\SOFTWARE_DEVELOPMENT\_APPS\temp\debug_files_for_ocr\testpdf_image1.pdf"
doc = pymupdf.open(file_path)
def get_images(file_name: str) -> float:
total_page_area = 0.0
total_text_area = 0.0
doc = fitz.open(file_name)
for page_index in range(len(doc)): # iterate over pdf pages
page = doc[page_index] # get the page
total_page_area = total_page_area + abs(page.bound())
print("total page area: ", total_page_area)
image_list = page.get_images(full=True)
# print the number of images found on the page
if image_list:
print(f"Found {len(image_list)} images on page {page_index}")
# Iterate through the images on the page
for img in image_list:
#print(page.get_image_bbox(img))
bbox = page.get_image_bbox(img) # Get the bounding box of the image
area = bbox.width * bbox.height # Calculate the area of the image
print("image area:", area)
else:
print("No images found on page", page_index)
doc.close()
return
get_images(file_path)
How to reproduce the bug
Run code provided on attached file, result should be
total page area: 484704.0
No images found on page 0
PyMuPDF version
1.26.3
Operating system
Windows
Python version
3.13
Wrong statement 😎 in the title. Should be "get_images() returns 0 images for a page that has no images".
The page has indeed no images.
It also contains no text.
What it contains lots of instead are vector graphics that mimic characters by drawing them.
@JorjMcKie Apologies, I didn't consider svg graphics, I suppose I can use the get_drawings method to detect them instead.
No worries!
Yes, get_drawings() extracts vectors on a detail level. And get_svg_image generates an SVG image of the page (a string).
SVG syntax is XML format, whereas get_drawings delivers a list of Python dictionaries.