`Page rot` metadata and `size` param interact incorrectly in convert_from_path()
Crowfunder opened this issue · 5 comments
Describe the bug
Attempting to convert a pdf with a size
param, with pdf Page rot
rotation metadata changing its original orientation (90, 270 etc) forces the scanned pages onto i.e a horizontal template, despite it being vertical. Any PDF viewer displays the pdf, correctly, as a vertical one. As a result of this issue, half of the page is cut off, and its remainder is squished.
To Reproduce
Steps to reproduce the behavior:
import numpy as np
import cv2
from pdf2image import convert_from_path, pdfinfo_from_path
pdf_path = 'our pdf path'
# Return PDF rotation from its metadata
rotation = pdf2image.pdfinfo_from_path(pdf_path)['Page rot'])
print(f'PDF rotation: {rotation}')
# Get the pdf pages' images
images = convert_from_path(pdf_path, 600, size=(1653, 2338))
# Write all page images to files
i=0
for image in images:
i+=1
cv2.imwrite(f'page{i}.jpg', np.array(image))
Expected behavior
Rotation metadata and size param get applied correctly.
Desktop (please complete the following information):
- OS: Debian WSL on Win10
- Version 22
Notes:
I'm well aware that it's probably an issue with Poppler, not with pdf2image, but there may be some walkaround, or some info may be gathered here for a Poppler issue.
Theoretically the issue will be resolved if the rotation gets applied into the file permanently, instead of being embedded in metadata.
Could you try to manually run popper on the asset? Something like:
pdftoppm -r 200 your_asset.pdf out
As you pointed out this might be an issue with poppler but I'd like to confirm first. You can also try to use pdftocairo
and see if the orientation is correct in that case.
pdftoppm -r 200 your_asset.pdf out
This one worked perfectly.
Ok so the issue is with pdf2image somehow. Can you share the asset?
Forgot to mention that pdftocairo works fine.
the pdf in question https://wormhole.app/kRZQl#5lMmzZ6BtD7RFIGOOaTOsw
I just ran into a similar issue also with dpi not being set correctly. Not sure if this helps the debug process, but in my code I decided to the following:
page[0].save(f"{working_path}{pdf[0:pdf.find('.')]}.tif", "TIFF", dpi=300)
and saw this error in PIL/TiffImagePlugin.py:
ifd[RESOLUTION_UNIT] = 2
ifd[X_RESOLUTION] = dpi[0]
ifd[Y_RESOLUTION] = dpi[1]
which led me to believe dpi should be a 2 element list. So I then tried:
page[0].save(f"{working_path}{pdf[0:pdf.find('.')]}.tif", "TIFF", dpi=[300, 300])
and when I checked the .tif in Preview, the resolution was correct at 300dpi instead of 72.
Just to sum up, I converted a 11 x 8.5 pdf to tiff using the following lines and removed dpi=300
from convert_from_path and moved it to save as a 2 element list:
page = convert_from_path(f"{working_path}{pdf}", size=(3300, 2550))
page[0].save(f"{working_path}{pdf[0:pdf.find('.')]}.tif", "TIFF", dpi=[300, 300])
Hope this helps.