PIL.UnidentifiedImageError
camipozas opened this issue · 4 comments
camipozas commented
Describe the bug
Different behavior on my computer to AWS EC2 instance m5.xlarge
.
Expected behavior
That they have the same behavior since it works on my computer, however when I execute it it cannot find the images.
AWS Log
Process Process-1:
Traceback (most recent call last):
File "/opt/build/app/read_contracts.py", line 67, in read_contracts
text_contract = read_pdf(filepath)
File "/opt/build/app/read_contracts.py", line 27, in read_pdf
images_from_path = convert_from_path(pdf_path=pdf,
File "/usr/local/lib/python3.9/site-packages/pdf2image/pdf2image.py", line 218, in convert_from_path
images += _load_from_output_folder(
File "/usr/local/lib/python3.9/site-packages/pdf2image/pdf2image.py", line 517, in _load_from_output_folder
images.append(Image.open(os.path.join(output_folder, f)))
File "/usr/local/lib/python3.9/site-packages/PIL/Image.py", line 3123, in open
raise UnidentifiedImageError(
PIL.UnidentifiedImageError: cannot identify image file '/tmp/tmpqo3mn0om/2d473b9f-5b6c-46f0-9220-a4bf51124f6e-03.ppm'
Desktop (please complete the following information):
- OS: Ubuntu,
m5.xlarge
instance. - Version [e.g. 22] 22.04
Additional context
Function error
def read_pdf(pdf):
"""
It takes a pdf file, converts it to images, and then converts those images to text
:param pdf: The path to the PDF file you want to convert
:return: A string with the text of the pdf
"""
full_text = ''
with tempfile.TemporaryDirectory() as path:
images_from_path = convert_from_path(pdf_path=pdf,
dpi=350,
output_folder=path)
for page in tqdm(images_from_path):
full_text += image_to_text(page, lang='spa')
return full_text
I printed the filenames to see if it was a path issue but it displays correctly. Additionally I am using multiprocessing
, again in local it works but in the instance it does not.
camipozas commented
Belval commented
Is this only happening with a single PDF? If you run pdftoppm -r 200 -jpeg your_file.pdf out
does it show any warnings?
asanaa8 commented
same error as @camipozas