jorisschellekens/borb

Reading and writing PDF file damages it

6801318d8d opened this issue · 3 comments

#!/usr/bin/env python3

import typing
from borb.pdf import PDF, Document

doc: typing.Optional[Document] = None
with open("test.pdf", "rb") as pdf_file_handle:
    doc = PDF.loads(pdf_file_handle)
assert doc is not None
with open("test2.pdf", "wb") as pdf_file_handle:
    PDF.dumps(pdf_file_handle, doc)

test.pdf:
image

test2.pdf:
image

test.pdf

prolly something to do with PIL, looks like the image is loaded BGR instead of RGB

#!/usr/bin/env python3

import typing
from borb.pdf import PDF, Document

doc: typing.Optional[Document] = None
with open("test.pdf", "rb") as pdf_file_handle:
    doc = PDF.loads(pdf_file_handle)
assert doc is not None
with open("test2.pdf", "wb") as pdf_file_handle:
    PDF.dumps(pdf_file_handle, doc)

Hi! I see that your Image is flipped from the usual RGB display pallet to the CMYK display pallet, to change this.

to solve this, I know of 1 way that will require you to install OpenCV, a library known to manipulate image data.

This is how I turn a BGR image to a RGB image:

from borb.toolkit import ImageExtraction
from pprint import pprint
from borb.pdf import PDF
import numpy as np
import cv2

imageExtraction = ImageExtraction()

with open("output.pdf","rb") as file:
    document = PDF.loads(file)

assert document is not None

pprint(document.get_page(0)["Resources"])

for key, value in document.get_page(0)["Resources"]["XObject"].items():
    image = np.array(value)
    open_cv_image = cv2.cvtColor(open_cv_image,cv2.COLOR_BGR2RGB)
    OpenCV_imageSize = open_cv_image.shape
    print(OpenCV_imageSize)
    open_cv_image = cv2.resize(src=open_cv_image, dsize=(300,200))
    cv2.imshow("key",open_cv_image)

    cv2.waitKey(0)
    cv2.destroyAllWindows()

I will try to reproduce this issue and come back to you with a working solution in a few days time

#!/usr/bin/env python3

import typing
from borb.pdf import PDF, Document

doc: typing.Optional[Document] = None
with open("test.pdf", "rb") as pdf_file_handle:
    doc = PDF.loads(pdf_file_handle)
assert doc is not None
with open("test2.pdf", "wb") as pdf_file_handle:
    PDF.dumps(pdf_file_handle, doc)

test.pdf: image

test2.pdf: image

test.pdf

I have did some investigation and Found that the reason why the file was "damaged" was because Borb read the image as an RGB instead of a CMYK image, this can only be fixed by reproducing the PDF with all images set to RGB