ajrcarey/pdfium-render

`get_processed_image` gives reversed byte order

Victor-N-Suadicani opened this issue · 12 comments

So I tried the following:

    let document = pdfium.load_pdf_from_file(&args.file, None)?;

    for (page_n, page) in document.pages().iter().enumerate() {
        for (object_n, object) in page.objects().iter().enumerate() {
            match &object {
                PdfPageObject::Text(_) => println!("Got Text object"),
                PdfPageObject::Path(_) => println!("Got Path object"),
                PdfPageObject::Image(image_obj) => {
                    println!("Got Image object");
                    let image = image_obj.get_processed_image(&document)?;
                    image.save(format!("{page_n:0>3}_{object_n:0>3}.jpeg"))?;
                }
                PdfPageObject::Shading(_) => println!("Got Shading object"),
                PdfPageObject::FormFragment(_) => println!("Got FormFragment object"),
                PdfPageObject::Unsupported(_) => println!("Got Unsupported object"),
            }
        }
    }

This worked on the first image of a PDF I tried it on, after which I got an error Error: Pdfium(PdfiumLibraryInternalError(Unknown)). But the important point is that the image that did manage to get extracted had the colours all wrong - a mostly red image turned blue. I'm guessing this is due to #9 and that this doesn't use the set_reverse_byte_order flag?

Hi @Victor-N-Suadicani, thanks for raising this. Yes, this sounds a lot like #9, doesn't it? Pdfium internally uses BGR8 byte ordering throughout; the upstream image crate got rid of their support for BGR8 (or, at the very least, made working with it a lot harder) in their move from 0.23 to 0.24. For rendering complete pages, it is possible to tell Pdfium to flip the byte order, as you noted, by setting a flag during rendering. For getting the processed image from an image object, Pdfium provides the function FPDFImageObj_GetRenderedBitmap(), but there is no provision for passing any rendering flags.

It may be necessary to do some byte flipping manually within pdfium-render. I'm happy to implement this, preferably by using functionality provided by the upstream image crate if possible, or by writing it myself if necessary.

Can you attach the sample PDF you are working with?

I'm curious as to what might have caused the error you mentioned, but perhaps it will become clear upon examining your PDF.

I can't share the PDF unfortunately but I get the same error on the image-test.pdf you have yourself in the test directory.

Also interestingly the images there seem to turn very tiny after extraction...

Ok, that's fine. I'm going to work on this tomorrow.

Ok. There's at least three problems here :/

  • Pdfium uses BGR/BGRA internally; image::DynamicImage dropped BGR support in 0.24.x, leaving only RGB/RGBA support intact. pdfium-render needs to seamlessly convert between BGR* and RGB* as required, and doesn't. In the first instance, as you've identified, pdfium-render needs to convert from BGR* to RGB* as required when extracting image data from PdfPageImageObject objects, otherwise the colors are inverted.
  • Additionally, when assigning image data to an image object, pdfium-render needs to convert from RGB* to BGR* as required, otherwise the colors are inverted.
  • Thirdly, Pdfium does not always return a valid FPDF_BITMAP handle when asked to render an image object to a bitmap during a call to FPDFImageObj_GetRenderedBitmap(). This is why you see a Pdfium(PdfiumLibraryInternalError(Unknown)). That function is convenient to use, since it returns a bitmap directly rather than raw pixel data, but it is marked as experimental in the Pdfium API. I will try to use the older, non-experimental FPDFImagObj_GetImageDataDecoded() function to see if it is more stable. Those cases for which FPDFImageObj_GetRenderedBitmap() does not return a valid bitmap, it does not set a error value, so there is no indication from Pdfium as to exactly what the problem was.

Created utils::pixels module for pixel color channel conversion. Added utils::pixels::bgr_to_rgba(), utils::pixels::bgra_to_rgba(), utils::pixels::rgb_to_rgba(), and utils::pixels::rgba_to_bgra() functions. Added correct color conversions when assigning a new image to, or extracting an image from, a PdfPageImageObject.

I'll continue tomorrow with the third problem noted above, with a view to pushing all changes by the end of tomorrow.

Added new PdfPageImageObject::set_bitmap() function, to allow setting an object's image data from a BGR* bitmap without performing a double color conversion to and from image::DynamicImage first. Exposed utils::pixels::* conversion functions in PdfiumLibraryBindings. Updated README.md.

Does FPDFImagObj_GetImageDataDecoded take into account the mask (as I believe FPDFImageObj_GetRenderedBitmap is designed to do, despite being buggy)? I was hoping to be able to extract images with transparency.

No idea! My initial interest is in seeing whether FPDFImageObj_GetImageDataDecoded() has the same runtime behaviour as FPDFImageObj_GetRenderedBitmap(), i.e. does _GetImageDataDecoded() fail in the same places as _GetRenderedBitmap()? And the answer is no, it does not. _GetImageDataDecoded() always returns a valid byte stream. This suggests that bitmap rendering inside _GetRenderedBitmap() is the problem.

You're not the only person to encounter this problem. Based on the tiny hint at https://groups.google.com/g/pdfium/c/V-H9LpuHpPY, I looked at the transformation matrices of every image object in image-test.pdf. And lo and behold, where either of the matrix values a and d are negative values, _GetRenderedBitmap() fails. Where those matrix values are both positive, _GetRenderedBitmap() succeeds.

This suggests that this is indeed a bug in Pdfium, and you may wish to open a bug upstream. I am open to implementing a work-around in pdfium-render, if possible. For example, simply flipping the matrix a and d values from negative to positive results in a successful call to _GetRenderedBitmap() every time. Presumably, however, this affects the processed output.

I think at this point I'm going to push the changes I've made so far, and post my test code here, so you can play around with it and have a think about what you want to do.

In fact, since this _GetRenderedBitmap() problem is separate from the original reason you opened the issue - the mismatched coloring - I'm going to create a new issue. This issue will remain focussed on the color byte ordering.

Spawned new issue #52 to handle the upstream issue. Implemented unit tests for utils::pixels::* color conversion functions. Scheduled for inclusion in crate version 0.7.21.

You can experiment with my changes to PdfPageImageObject by adding pdfium-render as a git dependency in your project's Cargo.toml.

My sample code is basically the same as yours, but for completeness, here it is:

use pdfium_render::prelude::*;

fn main() -> Result<(), PdfiumError> {
    let bindings =
        Pdfium::bind_to_library(Pdfium::pdfium_platform_library_name_at_path("./"))
            .or_else(|_| Pdfium::bind_to_system_library())?;

    let pdfium = Pdfium::new(bindings);

    let document = pdfium.load_pdf_from_file("image-test.pdf", None)?;

    for (page_n, page) in document.pages().iter().enumerate() {
        for (object_n, object) in page.objects().iter().enumerate() {
            match &object {
                PdfPageObject::Text(_) => println!("Got Text object"),
                PdfPageObject::Path(_) => println!("Got Path object"),
                PdfPageObject::Image(image_obj) => {
                    println!("Got Image object");

                    let image = image_obj.get_processed_image(&document)?;
                    image
                        .save(format!(
                            "{page_n:0>3}_{object_n:0>3}-processed.jpeg"
                        ))
                        .map_err(|_| PdfiumError::ImageError)?;
                }
                PdfPageObject::Shading(_) => println!("Got Shading object"),
                PdfPageObject::FormFragment(_) => println!("Got FormFragment object"),
                PdfPageObject::Unsupported(_) => println!("Got Unsupported object"),
            }
        }
    }

    Ok(())
}

Thanks for the quick work! Using the latest master, the color problem has definitely been fixed. I'm able to extract many more images now as well, though I still run into an error later on in my own PDF (image-test.pdf works however). The biggest problem I think is that the images extracted in this way are quite small for some PDFs, much smaller than the highest possible resolution as far as I can tell.

Ok, that's good news about the colors. Pdfium returning the wrong image size may be related to #52; in any case, it is definitely a problem in Pdfium, rather than pdfium-render. I propose we mark this issue - which was with the color conversion (or lack thereof) between RGBA and BGRA, and which definitely was a bug in pdfium-render, not Pdfium - as complete, and let's discuss options for working around upstream problems in Pdfium as part of #52.

As there have been no further comments on this, and I believe the original problem has been fixed, I am closing this issue. Let's continue discussion about your upstream problems in #52.

Fix for reversed byte order released as part of 0.7.21. Published to crates.io.