Belval/pdf2image

PIL.UnidentifiedImageError

camipozas opened this issue · 4 comments

Describe the bug
Different behavior on my computer to AWS EC2 instance m5.xlarge.

Expected behavior
That they have the same behavior since it works on my computer, however when I execute it it cannot find the images.

AWS Log

Process Process-1:
Traceback (most recent call last):
  File "/opt/build/app/read_contracts.py", line 67, in read_contracts
    text_contract = read_pdf(filepath)
  File "/opt/build/app/read_contracts.py", line 27, in read_pdf
    images_from_path = convert_from_path(pdf_path=pdf,
  File "/usr/local/lib/python3.9/site-packages/pdf2image/pdf2image.py", line 218, in convert_from_path
    images += _load_from_output_folder(
  File "/usr/local/lib/python3.9/site-packages/pdf2image/pdf2image.py", line 517, in _load_from_output_folder
    images.append(Image.open(os.path.join(output_folder, f)))
  File "/usr/local/lib/python3.9/site-packages/PIL/Image.py", line 3123, in open
    raise UnidentifiedImageError(
PIL.UnidentifiedImageError: cannot identify image file '/tmp/tmpqo3mn0om/2d473b9f-5b6c-46f0-9220-a4bf51124f6e-03.ppm'

Desktop (please complete the following information):

  • OS: Ubuntu, m5.xlarge instance.
  • Version [e.g. 22] 22.04

Additional context

Function error

def read_pdf(pdf):
    """
    It takes a pdf file, converts it to images, and then converts those images to text
    :param pdf: The path to the PDF file you want to convert
    :return: A string with the text of the pdf
    """
    full_text = ''
    with tempfile.TemporaryDirectory() as path:
        images_from_path = convert_from_path(pdf_path=pdf,
                                             dpi=350,
                                             output_folder=path)

        for page in tqdm(images_from_path):
            full_text += image_to_text(page, lang='spa')
    return full_text

I printed the filenames to see if it was a path issue but it displays correctly. Additionally I am using multiprocessing, again in local it works but in the instance it does not.

Is this only happening with a single PDF? If you run pdftoppm -r 200 -jpeg your_file.pdf out does it show any warnings?

same error as @camipozas

@asanaa8 I fixed with this