PDFPageCountError: Unable to get page count on Linux

Question

PDFPageCountError: Unable to get page count on Linux

heysander opened this issue 2 years ago · 2 comments

Hi everyone,

I've set up a project which uses pdf2image. I installed Poppler with Brew and it works locally (on my MacOS) like a charm.

Production on the other hand drives me crazy. I setup a Dockerfile and added the following command:
RUN apt update && apt-get install -y poppler-utils

CLI outputs:

$ find / -name poppler-utils
/usr/share/lintian/overrides/poppler-utils
/usr/share/doc/poppler-utils

$ find / -name poppler
/usr/local/lib/python3.10/dist-packages/poppler
/usr/share/poppler

$ pdfinfo
pdfinfo version 22.02.0
Copyright 2005-2022 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC
Usage: pdfinfo [options] <PDF-file>
  -f <int>             : first page to convert
  -l <int>             : last page to convert
  -box                 : print the page bounding boxes
  -meta                : print the document metadata (XML)
  -custom              : print both custom and standard metadata
  -js                  : print all JavaScript in the PDF
  -struct              : print the logical document structure (for tagged files)
  -struct-text         : print text contents along with document structure (for tagged files)
  -isodates            : print the dates in ISO-8601 format
  -rawdates            : print the undecoded date strings directly from the PDF file
  -dests               : print all named destinations in the PDF
  -url                 : print all URLs inside PDF objects (does not scan text content)
  -enc <string>        : output text encoding name
  -listenc             : list available encodings
  -opw <string>        : owner password (for encrypted files)
  -upw <string>        : user password (for encrypted files)
  -v                   : print copyright and version info
  -h                   : print usage information
  -help                : print usage information
  --help               : print usage information
  -?                   : print usage information

Everything seems to be installed correctly. But the moment I try to convert a pdf_from_path I retrieve the following error:

PDFPageCountError: Unable to get page count.
Internal Error: Cannot handle URI 'https://project.blob.core.windows.net/media/invoices/pdf/51200617.pdf'.

Python code:

		try:
			file_path = f'https://project.blob.core.windows.net/media/invoices/pdf/{file_name}'
			images = convert_from_path(file_path, 500)
			n=0
			cleaned_name = str(file_name)[:-4]
			for img in images:
				blob = BytesIO()
				img.save(blob, 'JPEG')
				img_entry = ImageEntry.objects.create(invoice=invoice)
				img_entry.img_file.save(f'{cleaned_name}-{n}.jpg', File(blob), save=True) 
				n+=1
		
		except PDFInfoNotInstalledError as err:
			print(f"PDFInfoNotInstalledError: {err}")	

		except PDFPageCountError as err:
			print(f"PDFPageCountError: {err}")	
	
		except PDFSyntaxError as err:
			print(f"PDFSyntaxError: {err}")	
		
		except Exception as err:
			print(f"Exception: {err}")

Docker-compose:
version: '3.4'

services:
  project:
    image: project.azurecr.io/project:latest
    platform: linux/x86_64
    build:
      context: .
      dockerfile: ./Dockerfile
    ports:
      - 8000:8000

The answers on this error I find by search are all related to poppler_path and windows, which does not help. Hope someone can tell me with this issue.

Thanks in advance.

Answer 1 · 2022-10-17T00:11:18.000Z

Internal Error: Cannot handle URI 'https://project.blob.core.windows.net/media/invoices/pdf/51200617.pdf'.

Seems like you are trying to parse a PDF hosted on a webpage, this is not supported. You need to download the file locally (to disk of memory) before trying to parse it.

Answer 2 · 2022-10-17T06:48:12.000Z

Works, thanks so much!