A python (3.6+) module that wraps pdftoppm and pdftocairo to convert PDF to a PIL Image object
pip install pdf2image
Windows users will have to build or download poppler for Windows. I recommend @oschwartz10612 version which is the most up-to-date. You will then have to add the bin/
folder to PATH or use poppler_path = r"C:\path\to\poppler-xx\bin" as an argument
in convert_from_path
.
Mac users will have to install poppler for Mac.
Most distros ship with pdftoppm
and pdftocairo
. If they are not installed, refer to your package manager to install poppler-utils
- Install poppler:
conda install -c conda-forge poppler
- Install pdf2image:
pip install pdf2image
from pdf2image import convert_from_path, convert_from_bytes
from pdf2image.exceptions import (
PDFInfoNotInstalledError,
PDFPageCountError,
PDFSyntaxError
)
Then simply do:
images = convert_from_path('/home/belval/example.pdf')
OR
images = convert_from_bytes(open('/home/belval/example.pdf', 'rb').read())
OR better yet
import tempfile
with tempfile.TemporaryDirectory() as path:
images_from_path = convert_from_path('/home/belval/example.pdf', output_folder=path)
# Do something here
images
will be a list of PIL Image representing each page of the PDF document.
Here are the definitions:
convert_from_path(pdf_path, dpi=200, output_folder=None, first_page=None, last_page=None, fmt='ppm', jpegopt=None, thread_count=1, userpw=None, use_cropbox=False, strict=False, transparent=False, single_file=False, output_file=str(uuid.uuid4()), poppler_path=None, grayscale=False, size=None, paths_only=False, use_pdftocairo=False, timeout=600)
convert_from_bytes(pdf_file, dpi=200, output_folder=None, first_page=None, last_page=None, fmt='ppm', jpegopt=None, thread_count=1, userpw=None, use_cropbox=False, strict=False, transparent=False, single_file=False, output_file=str(uuid.uuid4()), poppler_path=None, grayscale=False, size=None, paths_only=False, use_pdftocairo=False, timeout=600)
- Allow users to hide attributes when using pdftoppm with
hide_attributes
(Thank you @StaticRocket) - Fix console opening on Windows (Thank you @OhMyAgnes!)
- Add
timeout
parameter which raisesPDFPopplerTimeoutError
after the given number of seconds. - Add
use_pdftocairo
parameter which forcespdf2image
to usepdftocairo
. Should improve performance. - Fixed a bug where using
pdf2image
with multiple threads (but not multiple processes) would cause and exception jpegopt
parameter allows for tuning of the output JPEG when usingfmt="jpeg"
(-jpegopt
in pdftoppm CLI) (Thank you @abieler)pdfinfo_from_path
andpdfinfo_from_bytes
which expose the output of the pdfinfo CLIpaths_only
parameter will return image paths instead of Image objects, to prevent OOM when converting a big PDFsize
parameter allows you to define the shape of the resulting images (-scale-to
in pdftoppm CLI)size=400
will fit the image to a 400x400 box, preserving aspect ratiosize=(400, None)
will make the image 400 pixels wide, preserving aspect ratiosize=(500, 500)
will resize the image to 500x500 pixels, not preserving aspect ratio
grayscale
parameter allows you to convert images to grayscale (-gray
in pdftoppm CLI)single_file
parameter allows you to convert the first PDF page only, without adding digits at the end of theoutput_file
- Allow the user to specify poppler's installation path with
poppler_path
- Using an output folder is significantly faster if you are using an SSD. Otherwise i/o usually becomes the bottleneck.
- Using multiple threads can give you some gains but avoid more than 4 as this will cause i/o bottleneck (even on my NVMe SSD!).
- If i/o is your bottleneck, using the JPEG format can lead to significant gains.
- PNG format is pretty slow, this is because of the compression.
- If you want to know the best settings (most settings will be fine anyway) you can clone the project and run
python tests.py
to get timings.
- A relatively big PDF will use up all your memory and cause the process to be killed (unless you use an output folder)