PDFPageCountError: Unable to get page count on Linux
heysander opened this issue · 2 comments
Hi everyone,
I've set up a project which uses pdf2image. I installed Poppler with Brew and it works locally (on my MacOS) like a charm.
Production on the other hand drives me crazy. I setup a Dockerfile and added the following command:
RUN apt update && apt-get install -y poppler-utils
CLI outputs:
$ find / -name poppler-utils
/usr/share/lintian/overrides/poppler-utils
/usr/share/doc/poppler-utils
$ find / -name poppler
/usr/local/lib/python3.10/dist-packages/poppler
/usr/share/poppler
$ pdfinfo
pdfinfo version 22.02.0
Copyright 2005-2022 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC
Usage: pdfinfo [options] <PDF-file>
-f <int> : first page to convert
-l <int> : last page to convert
-box : print the page bounding boxes
-meta : print the document metadata (XML)
-custom : print both custom and standard metadata
-js : print all JavaScript in the PDF
-struct : print the logical document structure (for tagged files)
-struct-text : print text contents along with document structure (for tagged files)
-isodates : print the dates in ISO-8601 format
-rawdates : print the undecoded date strings directly from the PDF file
-dests : print all named destinations in the PDF
-url : print all URLs inside PDF objects (does not scan text content)
-enc <string> : output text encoding name
-listenc : list available encodings
-opw <string> : owner password (for encrypted files)
-upw <string> : user password (for encrypted files)
-v : print copyright and version info
-h : print usage information
-help : print usage information
--help : print usage information
-? : print usage information
Everything seems to be installed correctly. But the moment I try to convert a pdf_from_path I retrieve the following error:
PDFPageCountError: Unable to get page count.
Internal Error: Cannot handle URI 'https://project.blob.core.windows.net/media/invoices/pdf/51200617.pdf'.
Python code:
try:
file_path = f'https://project.blob.core.windows.net/media/invoices/pdf/{file_name}'
images = convert_from_path(file_path, 500)
n=0
cleaned_name = str(file_name)[:-4]
for img in images:
blob = BytesIO()
img.save(blob, 'JPEG')
img_entry = ImageEntry.objects.create(invoice=invoice)
img_entry.img_file.save(f'{cleaned_name}-{n}.jpg', File(blob), save=True)
n+=1
except PDFInfoNotInstalledError as err:
print(f"PDFInfoNotInstalledError: {err}")
except PDFPageCountError as err:
print(f"PDFPageCountError: {err}")
except PDFSyntaxError as err:
print(f"PDFSyntaxError: {err}")
except Exception as err:
print(f"Exception: {err}")
Docker-compose:
version: '3.4'
services:
project:
image: project.azurecr.io/project:latest
platform: linux/x86_64
build:
context: .
dockerfile: ./Dockerfile
ports:
- 8000:8000
The answers on this error I find by search are all related to poppler_path and windows, which does not help. Hope someone can tell me with this issue.
Thanks in advance.
Internal Error: Cannot handle URI 'https://project.blob.core.windows.net/media/invoices/pdf/51200617.pdf'.
Seems like you are trying to parse a PDF hosted on a webpage, this is not supported. You need to download the file locally (to disk of memory) before trying to parse it.
Works, thanks so much!