Reading and writing PDF file damages it
6801318d8d opened this issue · 3 comments
6801318d8d commented
#!/usr/bin/env python3
import typing
from borb.pdf import PDF, Document
doc: typing.Optional[Document] = None
with open("test.pdf", "rb") as pdf_file_handle:
doc = PDF.loads(pdf_file_handle)
assert doc is not None
with open("test2.pdf", "wb") as pdf_file_handle:
PDF.dumps(pdf_file_handle, doc)
DrPlanecraft commented
prolly something to do with PIL, looks like the image is loaded BGR instead of RGB
DrPlanecraft commented
#!/usr/bin/env python3 import typing from borb.pdf import PDF, Document doc: typing.Optional[Document] = None with open("test.pdf", "rb") as pdf_file_handle: doc = PDF.loads(pdf_file_handle) assert doc is not None with open("test2.pdf", "wb") as pdf_file_handle: PDF.dumps(pdf_file_handle, doc)
Hi! I see that your Image is flipped from the usual RGB display pallet to the CMYK display pallet, to change this.
to solve this, I know of 1 way that will require you to install OpenCV, a library known to manipulate image data.
This is how I turn a BGR image to a RGB image:
from borb.toolkit import ImageExtraction
from pprint import pprint
from borb.pdf import PDF
import numpy as np
import cv2
imageExtraction = ImageExtraction()
with open("output.pdf","rb") as file:
document = PDF.loads(file)
assert document is not None
pprint(document.get_page(0)["Resources"])
for key, value in document.get_page(0)["Resources"]["XObject"].items():
image = np.array(value)
open_cv_image = cv2.cvtColor(open_cv_image,cv2.COLOR_BGR2RGB)
OpenCV_imageSize = open_cv_image.shape
print(OpenCV_imageSize)
open_cv_image = cv2.resize(src=open_cv_image, dsize=(300,200))
cv2.imshow("key",open_cv_image)
cv2.waitKey(0)
cv2.destroyAllWindows()
I will try to reproduce this issue and come back to you with a working solution in a few days time
DrPlanecraft commented
#!/usr/bin/env python3 import typing from borb.pdf import PDF, Document doc: typing.Optional[Document] = None with open("test.pdf", "rb") as pdf_file_handle: doc = PDF.loads(pdf_file_handle) assert doc is not None with open("test2.pdf", "wb") as pdf_file_handle: PDF.dumps(pdf_file_handle, doc)
I have did some investigation and Found that the reason why the file was "damaged" was because Borb read the image as an RGB instead of a CMYK image, this can only be fixed by reproducing the PDF with all images set to RGB