danfickle/openhtmltopdf

'java.lang.OutOfMemoryError' when using a Base64 encoded, embedded JPEG image

skjardenCode opened this issue · 8 comments

Hello,

I recently experienced an OutOfMemory error while using the OpenHtmlToPDF framework. Our requirements are rather normal, that is, generating a PDF file out of a simple HTML file which contains only basic CSS 2.0 and XHTML - mainly tables, text and up to three images.

We ran a stress test because the framework should be integrated in our server component, which needs to convert HTML to PDF for our clients. I used a fairly simple for-loop to iterate over HTML files and for each HTML-content, we used OpenHtmlToPDF to generate a PDF file. After about 6000 iterations, the test stopped and a "java.lang.OutOfMemoryError" was shown.

I then took apart all the components, stripped away code step by step to reproduce the OutOfMemoryError with minimal test-code and the result was this simple test case:

@Test
public void test_stressPdfRendererBuilder() throws Exception
{
    int count = 10000;

    String html = FileUtils.readFileToString( new File( "html-with-embedded-jpg.html" ), Charsets.UTF_8 );

    for ( int i = 0; i < count; i++ )
    {
        System.err.println( "i: " + i );

        ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();

        PdfRendererBuilder builder = new PdfRendererBuilder();
        builder.withHtmlContent( html, null );
        builder.toStream( byteArrayOutputStream );

        builder.run();
    }
}

The file html-with-embedded-jpg.html is a simple HTML with a img-Tag with embedded JPEG image (Base64 encoded). You can display that HTML file with the image with any browser.

Running the above test, one can see in the Windows Task Manager, how the occupied memory grows rapidly (interestingly, the Java heap space is doing "ok"). In iteration 5000, it was at ~ 1,6 GB.

After about iteration 6000, the "java.lang.OutOfMemoryError" occurs, with the following stack trace:

java.lang.OutOfMemoryError: Initializing Reader
    at com.sun.imageio.plugins.jpeg.JPEGImageReader.initJPEGImageReader(Native Method)
    at com.sun.imageio.plugins.jpeg.JPEGImageReader.<init>(Unknown Source)
    at com.sun.imageio.plugins.jpeg.JPEGImageReaderSpi.createReaderInstance(Unknown Source)
    at javax.imageio.spi.ImageReaderSpi.createReaderInstance(Unknown Source)
    at javax.imageio.ImageIO$ImageReaderIterator.next(Unknown Source)
    at javax.imageio.ImageIO$ImageReaderIterator.next(Unknown Source)
    at org.apache.pdfbox.pdmodel.graphics.image.JPEGFactory.readJPEG(JPEGFactory.java:103)
    at org.apache.pdfbox.pdmodel.graphics.image.JPEGFactory.createFromStream(JPEGFactory.java:78)
    at com.openhtmltopdf.pdfboxout.PdfBoxOutputDevice.realizeImage(PdfBoxOutputDevice.java:688)
    at com.openhtmltopdf.pdfboxout.PdfBoxUserAgent.getImageResource(PdfBoxUserAgent.java:81)
    at com.openhtmltopdf.pdfboxout.PdfBoxReplacedElementFactory.createReplacedElement(PdfBoxReplacedElementFactory.java:58)
    at com.openhtmltopdf.render.BlockBox.calcMinMaxWidth(BlockBox.java:1524)
    at com.openhtmltopdf.render.BlockBox.calcMinMaxWidthInlineChildren(BlockBox.java:1684)
    at com.openhtmltopdf.render.BlockBox.calcMinMaxWidth(BlockBox.java:1567)
    at com.openhtmltopdf.newtable.TableBox$AutoTableLayout.recalcColumn(TableBox.java:1240)
    at com.openhtmltopdf.newtable.TableBox$AutoTableLayout.fullRecalc(TableBox.java:1214)
    at com.openhtmltopdf.newtable.TableBox$AutoTableLayout.calcMinMaxWidth(TableBox.java:1509)
    at com.openhtmltopdf.newtable.TableBox.calcMinMaxWidth(TableBox.java:158)
    at com.openhtmltopdf.newtable.TableBox.layout(TableBox.java:221)
    at com.openhtmltopdf.layout.BlockBoxing.layoutBlockChild0(BlockBoxing.java:321)
    at com.openhtmltopdf.layout.BlockBoxing.layoutBlockChild(BlockBoxing.java:299)
    at com.openhtmltopdf.layout.BlockBoxing.layoutContent(BlockBoxing.java:90)
    at com.openhtmltopdf.render.BlockBox.layoutChildren(BlockBox.java:990)
    at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:870)
    at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:799)
    at com.openhtmltopdf.layout.BlockBoxing.layoutBlockChild0(BlockBoxing.java:321)
    at com.openhtmltopdf.layout.BlockBoxing.layoutBlockChild(BlockBoxing.java:299)
    at com.openhtmltopdf.layout.BlockBoxing.layoutContent(BlockBoxing.java:90)
    at com.openhtmltopdf.render.BlockBox.layoutChildren(BlockBox.java:990)
    at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:870)
    at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:799)
    at com.openhtmltopdf.layout.BlockBoxing.layoutBlockChild0(BlockBoxing.java:321)
    at com.openhtmltopdf.layout.BlockBoxing.layoutBlockChild(BlockBoxing.java:299)
    at com.openhtmltopdf.layout.BlockBoxing.layoutContent(BlockBoxing.java:90)
    at com.openhtmltopdf.render.BlockBox.layoutChildren(BlockBox.java:990)
    at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:870)
    at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:799)
    at com.openhtmltopdf.pdfboxout.PdfBoxRenderer.layout(PdfBoxRenderer.java:431)
    at com.openhtmltopdf.pdfboxout.PdfRendererBuilder.run(PdfRendererBuilder.java:54)
    ...
    ...

I digged into the code and ended up in PdfBoxOutputDevice.realizeImage(PdfBoxImage) where I found the following lines:

if (img.isJpeg()) {
    xobject = JPEGFactory.createFromStream(_writer,
            new ByteArrayInputStream(img.getBytes()));
} else {
    BufferedImage buffered = ImageIO.read(new ByteArrayInputStream(
            img.getBytes()));

    xobject = LosslessFactory.createFromImage(_writer, buffered);
}

So there is a condition where JPEGFactory.createFromStream is used, if the image is an JPEG, otherwise ImageIO.read is used.

So I changed my test-html-file to embed a PNG instead of an JPEG image - and the OutOfMemory error was gone. Java heap is doing fine, the Windows Task Manager shows only ~ 80 MB memory usage for the Java process no matter how many iterations I run.

Doing a simple seach I came across this:

Maybe there is a problem in PDFBox or in the way, the PDFBox-API is used to integrate an JPEG image into a PDDocument, I'm not sure.

So, the workaround for me is to not use the JPEG image format when embedding an image into the HTML code, but instead using PNG.

I wanted to post this issue here first. I'm sure you know how to debug the code better than me, but I hope I could help a bit with the above information.

Hope to hear from you and that there is an easy fix for it. Or maybe this is a bug in PdfBox eventually.

Thanks a lot!

Hi @skjardenCode
Thanks for the detailed work up. It does seem it may be a pdfbox bug, but I'll try to reproduce it with raw pdfbox code to make sure.

The pdfbox code in question is at the link below. The only thing I can see is that all the readers returned by the iterator may not have dispose called on them. It is also surprising that they always decompress the entire image even though I believe they just need metadata.

https://github.com/apache/pdfbox/blob/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/graphics/image/JPEGFactory.java

I'll comment again here when I have debugged further.

Unfortunately, I can't replicate this on mac (even with -Xmx30m. Possibly a windows specific issue? If you get a moment, could you please run the following code? If it crashes too, it will tell us definitively that the bug is in PDFBOX or the JRE.

import java.io.ByteArrayInputStream;
import java.util.Base64;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.graphics.image.JPEGFactory;


public class TestUsage {
    public static void main(String...args) throws Exception {
        String jpeg = 
                "/9j/4AAQSkZJRgABAQEAYABgAAD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8UHRofHh0aHBwgJC4nICIs" +
                "IxwcKDcpLDAxNDQ0Hyc5PTgyPC4zNDL/2wBDAQkJCQwLDBgNDRgyIRwhMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIy" + 
                "MjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjL/wAARCAABAAEDASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAA" + 
                "AAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUFBAQAAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAk" + 
                "M2JyggkKFhcYGRolJicoKSo0NTY3ODk6Q0RFRkdISUpTVFVWV1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWGh4iJipKT" + 
                "lJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uHi4+Tl5ufo6erx8vP09fb3+Pn6/8QA" +
                "HwEAAwEBAQEBAQEBAQAAAAAAAAECAwQFBgcICQoL/8QAtREAAgECBAQDBAcFBAQAAQJ3AAECAxEEBSExBhJBUQdh" + 
                "cRMiMoEIFEKRobHBCSMzUvAVYnLRChYkNOEl8RcYGRomJygpKjU2Nzg5OkNERUZHSElKU1RVVldYWVpjZGVmZ2hp" + 
                "anN0dXZ3eHl6goOEhYaHiImKkpOUlZaXmJmaoqOkpaanqKmqsrO0tba3uLm6wsPExcbHyMnK0tPU1dbX2Nna4uPk" + 
                "5ebn6Onq8vP09fb3+Pn6/9oADAMBAAIRAxEAPwD3+iiigD//2Q==";

        byte[] jpegBytes = Base64.getDecoder().decode(jpeg);

        for (int i = 0; i < 10000000; i++) {
            PDDocument doc = new PDDocument();

            try {
                JPEGFactory.createFromStream(doc, new ByteArrayInputStream(jpegBytes));
            } finally {
                doc.close();
            }
        }
    }
}

Thanks,
Daniel.

Hey Daniel,

thanks a lot for your reply. I'm currently on a trip, I'll test your code as soon as I get back home in a few hours.

The only thing I can see is that all the readers returned by the iterator may not have dispose called on them.

You speak about the method private static BufferedImage readJPEG(InputStream stream) throws IOException ? At the end of it, there is a

// ....
finally
{
    if (iis != null)
    {
        iis.close();
    }
    reader.dispose();
}

closing the used reader. Or do you see another location where a reader does not get closed properly?

~ Timo

Hey Daniel,

I'm sorry for the late answer, had a lot of work to be finished first.

I did the follwing 3 things with the given results:

  1. Running your example code
    --> No OufOfMemory error, Windows Task Manager shows normal and stable memory usage even with thousands of iterations

  2. Next I changed your JPEG-code to the one I used inside of my test-html-file, which is bigger than the one in your example
    --> Again no OufOfMemory error, Windows Task Manager shows normal and stable memory usage even with thousands of iterations

  3. Third I wrapped the JPEG-code into very simple HTML and extended the test to use the PdfRendererBuilder / builder.run(); again instead of just JPEGFactory.createFromStream(...).
    --> There is is again, a OufOfMemory error after a few thousand iterations and the Windows Task Manager shows memory usage for javaw.exe of over 1,5 GB after just a few seconds:

image

Test-code:

public class OpenHtmlToPdfOutOfMemoryTest2
{
    public static void main( String... args ) throws Exception
    {
        String jpeg =
                    "/9j/4AAQSkZJRgABAQEAYABgAAD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8UHRofHh0aHBwgJC4nICIs" +
                    "IxwcKDcpLDAxNDQ0Hyc5PTgyPC4zNDL/2wBDAQkJCQwLDBgNDRgyIRwhMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIy" + 
                    "MjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjL/wAARCAABAAEDASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAA" + 
                    "AAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUFBAQAAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAk" + 
                    "M2JyggkKFhcYGRolJicoKSo0NTY3ODk6Q0RFRkdISUpTVFVWV1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWGh4iJipKT" + 
                    "lJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uHi4+Tl5ufo6erx8vP09fb3+Pn6/8QA" +
                    "HwEAAwEBAQEBAQEBAQAAAAAAAAECAwQFBgcICQoL/8QAtREAAgECBAQDBAcFBAQAAQJ3AAECAxEEBSExBhJBUQdh" + 
                    "cRMiMoEIFEKRobHBCSMzUvAVYnLRChYkNOEl8RcYGRomJygpKjU2Nzg5OkNERUZHSElKU1RVVldYWVpjZGVmZ2hp" + 
                    "anN0dXZ3eHl6goOEhYaHiImKkpOUlZaXmJmaoqOkpaanqKmqsrO0tba3uLm6wsPExcbHyMnK0tPU1dbX2Nna4uPk" + 
                    "5ebn6Onq8vP09fb3+Pn6/9oADAMBAAIRAxEAPwD3+iiigD//2Q==";
        
        String html = 
                "<html xmlns=\"http://www.w3.org/1999/xhtml\" xml:lang=\"de\" lang=\"de\">" +
                    "<body>" +
                        "<img src=\"data:image/jpeg;base64," + jpeg + "\" />" +
                    "</body>" +
                "</html>";

        for ( int i = 0; i < 10000000; i++ )
        {
            System.out.println( i );
            
            PdfRendererBuilder builder = new PdfRendererBuilder();
            builder.withHtmlContent( html, null );
            builder.toStream( new ByteArrayOutputStream() );
            
            builder.run();
        }
    }
}

(In the code above it is still your JPEG-code due to the size)

I seems that this is a system / OS depended problem, maybe with ImageIO native code of some sort. Unfortunately, I'm no expert of the Java Memory Model. The heap during the test is OK, but the used memory shown in the Windows Task Manager is just "exploding". I know that those memory values are "virtual memory usage" and not directly related to JVM heap usage. But the memory usage accumulation and the OufOfMemory in the end are clearly indicating a problem.

As a side note, maybe important: I'm still using Java 6 ("1.6.0_45", SUN JDK) due to project restrictions at the moment.

Please let me know if I can be of any help, running some more tests etc.

Embarrassingly, it turns out I wasn't calling dispose! I've added it and done a release 0.0.1-RC8 so you could try your stress test again.

Sorry to butt into this thread but shouldn't reader.dispose() be inside a finally clause?

Sorry to butt into this thread but shouldn't reader.dispose() be inside a finally clause?

Yes, I think this should be the case. Just declare reader outside of the try..catch-block and use the existing finally block where you already close the stream.

I've added it and done a release 0.0.1-RC8 so you could try your stress test again.

Thanks a lot, I'll try it tomorrow and report back.

Hey Daniel,

I've added it and done a release 0.0.1-RC8 so you could try your stress test again.

I tested it again and it works - no more memory leak, I can do thousands of iterations, the heap and Windows Task Manager both stay below ~ 60 MB memory usage.

I've also done a re-check by commenting out the line reader.dispose(); in PdfBoxImage and the OutOfMemory error occurs again.

As @MartyMcMartface mentioned, you should do the disposal inside of a finally-block and everything should be fine.

Thanks for your support!