Belval/pdf2image

`Page rot` metadata and `size` param interact incorrectly in convert_from_path()

Crowfunder opened this issue · 5 comments

Describe the bug
Attempting to convert a pdf with a size param, with pdf Page rot rotation metadata changing its original orientation (90, 270 etc) forces the scanned pages onto i.e a horizontal template, despite it being vertical. Any PDF viewer displays the pdf, correctly, as a vertical one. As a result of this issue, half of the page is cut off, and its remainder is squished.

To Reproduce
Steps to reproduce the behavior:

import numpy as np
import cv2
from pdf2image import convert_from_path, pdfinfo_from_path

pdf_path = 'our pdf path'

# Return PDF rotation from its metadata
rotation = pdf2image.pdfinfo_from_path(pdf_path)['Page rot'])
print(f'PDF rotation: {rotation}') 

# Get the pdf pages' images
images = convert_from_path(pdf_path, 600, size=(1653, 2338))

# Write all page images to files
i=0
for image in images:
    i+=1
    cv2.imwrite(f'page{i}.jpg', np.array(image)) 

Expected behavior
Rotation metadata and size param get applied correctly.

Screenshots
An example page from a pdf with rotation

Desktop (please complete the following information):

  • OS: Debian WSL on Win10
  • Version 22

Notes:
I'm well aware that it's probably an issue with Poppler, not with pdf2image, but there may be some walkaround, or some info may be gathered here for a Poppler issue.

Theoretically the issue will be resolved if the rotation gets applied into the file permanently, instead of being embedded in metadata.

Belval commented

Could you try to manually run popper on the asset? Something like:

pdftoppm -r 200 your_asset.pdf out

As you pointed out this might be an issue with poppler but I'd like to confirm first. You can also try to use pdftocairo and see if the orientation is correct in that case.

pdftoppm -r 200 your_asset.pdf out
This one worked perfectly.

Belval commented

Ok so the issue is with pdf2image somehow. Can you share the asset?

Forgot to mention that pdftocairo works fine.
the pdf in question https://wormhole.app/kRZQl#5lMmzZ6BtD7RFIGOOaTOsw

I just ran into a similar issue also with dpi not being set correctly. Not sure if this helps the debug process, but in my code I decided to the following:
page[0].save(f"{working_path}{pdf[0:pdf.find('.')]}.tif", "TIFF", dpi=300)

and saw this error in PIL/TiffImagePlugin.py:
ifd[RESOLUTION_UNIT] = 2
ifd[X_RESOLUTION] = dpi[0]
ifd[Y_RESOLUTION] = dpi[1]

which led me to believe dpi should be a 2 element list. So I then tried:
page[0].save(f"{working_path}{pdf[0:pdf.find('.')]}.tif", "TIFF", dpi=[300, 300])

and when I checked the .tif in Preview, the resolution was correct at 300dpi instead of 72.

Just to sum up, I converted a 11 x 8.5 pdf to tiff using the following lines and removed dpi=300 from convert_from_path and moved it to save as a 2 element list:
page = convert_from_path(f"{working_path}{pdf}", size=(3300, 2550))
page[0].save(f"{working_path}{pdf[0:pdf.find('.')]}.tif", "TIFF", dpi=[300, 300])

Hope this helps.