danfickle/openhtmltopdf

Large HTML File conversion to PDF hangs.

rajaningle opened this issue · 25 comments

Hi,

I am trying to convert large HTML File approximately 600 pages which is not passing the conversion and hangs.

Following is my observation after debugging the core.
PdfRendererBuilder.class file has following method call.

  1. renderer.layout(); // This action takes significant time but completes the process.
    2. renderer.createPDF(); // This action is not completing its execution and hangs the process.

when I looked into it renderer.createPDF() is trying to create entire PDF in memory (document) and after completion it starts writing to OutputStream.

Can we write it directly to OutputStream page by page? I think this might solve the problem.

Following is my code snippet please check the same if I am doing anything wrong here.

public void exportToPdf(List<Map<String, Object>> data, String template, Map<String, Object> exportData,
            Configuration cfg, String pdfURL) throws Exception   {
        File exportedPdfFile = null;
        File exportedPdfFileTemp = null;
        FileChannel pdfFileSrcIS = null;
        FileChannel pdfFileDestOS = null;
        FileOutputStream tmpFileOS = null;
        BufferedOutputStream tmpFileBOS = null;
        FileInputStream pdfSrcIS = null;
        FileOutputStream exportedPdfFileOs = null;
        try {
            // Create New File
            exportedPdfFileTemp = new File(pdfURL + "_" + TEMP);
            LOGGER.info("### Temp PDF File Name After Creation :::" + pdfURL + "_" + TEMP);
            
            tmpFileOS = new FileOutputStream(exportedPdfFileTemp);
            tmpFileBOS = new BufferedOutputStream(tmpFileOS);
            // Create Builder
            PdfRendererBuilder builder = new PdfRendererBuilder();
            addFonts(builder);
            // Generate HTML Template String
            String htmlTemplateString = generateHtmlFromTemplate(data, template, exportData, cfg);
            // Generate Doc from the HTML string
            Document doc = html5ParseDocument(htmlTemplateString, PDF_GENERATION_TIMEOUT);// builder.withUri(url);
            builder.withW3cDocument(doc, null);
            // Write the PDF to file
            builder.toStream(tmpFileBOS);
            builder.run();

            LOGGER.info("::: PDF Generation Successful with :::");

                exportedPdfFile = new File(pdfURL);
                
                if (exportedPdfFileTemp.renameTo(exportedPdfFile)) {
                    LOGGER.info("### Temp FILE Renamed To ::: " + pdfURL );
                } else {
                    LOGGER.info("### Temp FILE Rename Failed Creating New File ::: " + pdfURL);
                    pdfSrcIS = new FileInputStream(exportedPdfFileTemp);
                    pdfFileSrcIS = pdfSrcIS.getChannel();
                    exportedPdfFileOs = new FileOutputStream(exportedPdfFile);  
                    pdfFileDestOS = exportedPdfFileOs.getChannel();
                    LOGGER.info("### Starting Copy Operation ::: " + pdfURL);
                    pdfFileDestOS.transferFrom(pdfFileSrcIS, 0, pdfFileSrcIS.size());
                    LOGGER.info("### Copy Operation Completed ::: " + pdfURL);
                }

                LOGGER.info("### *** File Created *** ::: " + pdfURL);
                LOGGER.info("### PDF Created successfully!");

        } catch (Exception e) {
            LOGGER.error("Error generating PDF :" + e.getMessage(), e);
            throw e;
        } finally {
            LOGGER.info("### PDF Created successfully with Name ::: " + pdfURL);
            // Close all streams

            if (exportedPdfFileOs != null) {
                org.apache.commons.io.IOUtils.closeQuietly(exportedPdfFileOs);
            }
            if (pdfSrcIS != null) {
                org.apache.commons.io.IOUtils.closeQuietly(pdfSrcIS);
            }
            
            if (tmpFileBOS != null) {
                org.apache.commons.io.IOUtils.closeQuietly(tmpFileBOS);
            }
            if (tmpFileOS != null) {
                org.apache.commons.io.IOUtils.closeQuietly(tmpFileOS);
            }
            // Clean after creation
            try {
                if (exportedPdfFileTemp.isFile()) {
                    if (exportedPdfFileTemp.delete()) {
                        LOGGER.info("### Temp PDF File :::" + exportedPdfFileTemp.getName() + " is deleted!");
                    } else {
                        LOGGER.error("Temp File Delete operation is failed.");
                    }
                }

            } catch (Exception deleteException) {
                LOGGER.error("Error Deleting Temp File!  Name :::" + pdfURL + deleteException.getMessage());
            }
        }
    }

In above code snippet it is not completing builder.run(); process and hangs.

Please help me with the solution.

Thanks in advance.

Sounds silly to ask, but how much memory are allocating to your JVM? Try setting a higher limit with -xmx

When there is not enough RAM, the generator will hang while eating all of your CPU time.

How long does it hang for? Could it be that it is hitting disk to use the swap space? As @dilworks asks, how much memory are you allocating to Java and how much physical memory is available to the machine?

Hanging is obviously unacceptable, so I'm keen to get to the bottom of this one. I'll also investigate the memory/disk options of PDF-BOX (currently it is constructed completely in memory) and reply here.

Hi Thanks for reply @dilworks and @danfickle

We have hosted it on AWS t2.micro instance where it never resolves (hangs indefinitely) we have provided following options:

Initial JVM heap size: 256m
JVM command line options: blank
Maximum JVM heap size: 256m
Maximum JVM permanent generation size: 64m

On My Local machine It hangs for more than 20 minutes and eats all the CPU. Physical memory on local is around 4GB free and heap size is 256m.

I will try increasing the heap as @dilworks suggested.

But I feel it will be better to directly construct it on the disk instead of memory which will give better performance.

@danfickle Please investigate and implement the solution. meanwhile I am also investigating PDF-BOX options to construct it on disk will post if found something useful.

Hi @dilworks I have tried assigning -xmx 2048 and it did not resolve the problem it still hangs.

@danfickle it is hitting the disk for swap space. please check below.

image

Thanks @rajaningle

I added a builder method to pass in your own PDDocument which can be configured in the constructor with a MemoryUsageSetting to control how much memory/disk is used by PDFBOX.

However, with my simple testing of a large document, this didn't fix the problem so I am now profiling with VisualVM to find CPU/Memory hogs. I've already found a major CPU hog as discussed in #170

Thanks for your patience and hopefully we can get this fixed.

Thanks @danfickle
I will check with new fix whether it improves performance in my project.

I was checking PDFBox options and came across doc.saveIcremental(outStream) Method.

Link: https://pdfbox.apache.org/docs/2.0.2/javadocs/org/apache/pdfbox/pdmodel/PDDocument.html#saveIncremental(java.io.OutputStream)

Please check if we can use it and whether this method resolves our problem.

Thanks.

Hi, Today we got the same issue as @rajaningle trying to convert an HTML about 400 pages with 0.0.1-RC12 version. After read this issue, we have tried the SNAPSHOT version using MemoryUsageSetting.setupTempFileOnly() on building a PDDocument and this works fine for us. (We don't do performance testing for now)

Are you planing to do a new release?

Thanks!

Hi @danfickle I tried with MemoryUsageSetting.setupTempFileOnly() and it did not solve the problem it is still hogging the CPU/Memory.

OK, I generate a large (inline only) document with this code:

	private static void createLargeInlineDoc() throws IOException {
		OutputStream os2 = new FileOutputStream("/Users/me/Documents/pdf-issues/issue-180.htm");
		
		PrintWriter pw = new PrintWriter(os2);
		
		pw.println("<html>");
		pw.println("<head>");
		pw.println("</head>");
		pw.println("<body>");
		
		for (int i = 0; i < 100000; i++) {
			pw.println("Normal <strong>Bold</strong> <i>Italic</i>");
		}

		pw.println("</body>");
		pw.println("</html>");
		
		pw.close();
		os2.close();
	}

After fixing the two BIDI performance bugs it is down to 11 seconds on my machine, from a staggering 400 seconds before!

Next up, in improving performance according to the profiler, is this monstrosity (finally one that's not mine), from com.openhtmltopdf.layout.WhitespaceStripper:

    private static String collapseWhitespace(InlineBox iB, IdentValue whitespace, String text, boolean collapseLeading) {
        if (whitespace == IdentValue.NORMAL || whitespace == IdentValue.NOWRAP) {
            text = linefeed_space_collapse.matcher(text).replaceAll(EOL);
        } else if (whitespace == IdentValue.PRE) {
            text = space_before_linefeed_collapse.matcher(text).replaceAll(EOL);
        }

        if (whitespace == IdentValue.NORMAL || whitespace == IdentValue.NOWRAP) {
            text = linefeed_to_space.matcher(text).replaceAll(SPACE);
            text = tab_to_space.matcher(text).replaceAll(SPACE);
            text = space_collapse.matcher(text).replaceAll(SPACE);
        } else if (whitespace == IdentValue.PRE || whitespace == IdentValue.PRE_WRAP) {
            int tabSize = (int) iB.getStyle().asFloat(CSSName.TAB_SIZE);
            char[] tabs = new char[tabSize];
            Arrays.fill(tabs, ' ');
            text = tab_to_space.matcher(text).replaceAll(new String(tabs));
        } else if (whitespace == IdentValue.PRE_LINE) {
            text = tab_to_space.matcher(text).replaceAll(SPACE);
            text = space_collapse.matcher(text).replaceAll(SPACE);
        }

        if (whitespace == IdentValue.NORMAL || whitespace == IdentValue.NOWRAP) {
            // collapse first space against prev inline
            if (text.startsWith(SPACE) &&
                    collapseLeading) {
                text = text.substring(1, text.length());
            }
        }

        return text;
    }

Note that text in normal mode goes through four regular expression replaces and a substring. Unless someone else provides a replacement without regular expressions, I'll work on it tomorrow, and then do the release.

I've decided to follow your steps and profile everything on my setup... using one of my RAM-eating please-have-mercy testcases: a rather simple table-based report (complete with headers and footers) that easily gets into the thousands of pages (it's a transaction log report for a entire year, and for a mid-sized customer it goes over 5000 pages) - this was the reason of why I was forced to fiddle with -xmx (apparently this flaw was inherited from FS). This report in particular is rather CPU-bound... until it's time to generate the PDF, when my JSF-generated XHTML brings FS/OH down to its knees, now massively eating RAM this time.

What I found was... this:
profile_oh_hotpaths
profile_oh_loggingwhat
profile_oh_xrlog_2
profile_oh_xrlog_jboss

A logging statement on this:


...is causing JBoss/WildFly logging subsystem to go insane and drain a non-insignificant slice of CPU time! Leaving my code outside, this single logging call ends eating almost half of the CPU time.

(And if you were wondering: no, I never got my 5000+ page PDF - profiling makes everything go much slower, plus I was testing with some real data that easily ate the 3GB limit I had set)

Thanks @dilworks

The only thing I could think of causing a slow down is the fact that it was logging as SEVERE. Could Wildfly be set up to do something special with SEVERE log messages? Anyway, I have downgraded it to WARNING to be consistent with other CSS warnings.

I also released RC-13, so we'll make the next release focused on performance and memory. Much work is needed to get 5000+ page documents running smoothly!

Loving that couple of fixes - after some quick tests now performance is on par with FS, and even beats it in a few times with the same 18-page test doc I had attached. But then, that's just the beginning

Thank you very much for the improvements @danfickle !

Thanks @danfickle there is some performance improvement with the current fixes but it still hangs while generating huge documents 5000+ pages or more. hoping for the performance improvement there I have this open defect and need to resolve it ASAP because we have all huge documents to be exported and functionality breaks while generating huge PDF. Please see if you can find solution to resolve this hang issue.

@rajaningle This may not solve your problem, but for those 5000+ pages you have many DOM nodes in memory, and therefore need tons of memory alone for your DOM nodes.

=> You could try exists-db to solve this memory problem. Exists-DB allows you to store big amount of XML in a persistent file. It also allows you to query it very fast using XQuery (this is what I had used exists-db for in another project ten years ago...). And all the nodes also implement org.w3c.dom.

Something like this could work:

If I understand it correctly from the documentation XMLResouce.getContentAsDOM() gets you the content as org.w3c.dom which is then lazy loaded from the database. So that you only have those nodes in memory which are needed at a time.

You could then feed the DOM into the PdfRendererBuilder using withW3cDocument(). I can not guarantee that this will work correctly and really reduces the memory pressure, but it is at least something you could try.

I've just created a testcase for this problem, see #194. This takes 5m 49s on my MacBook Pro 2014 (16 GB RAM) to create a HTML file with 18.5 MB and a result PDF with 232 MB and 12694 pages using JDK 1.7.0_52.

I.e. i can not reproduce the problem, it works for me. @rajaningle what JDK are you using on what OS? Please look at the testcase, it only contains text and tables. What other stuff are you using in your report?

Managed to find some time to test one of my huge files - I'm attaching a sample (a couple pages with test data - the real production report only has longer strings and bigger numbers but nothing else) so you guys can check the layout - it's rather simple as I've already said yet it can grow up in size easily since the report is a transaction log - with one of my datasets it generates a ~4300-page PDF. I'm testing with -xmx3g (my laptop only has 6GB RAM, thankfully our production setups either never have enough data to push things to the limits, or have at least 8GB dedicated for our app).

  • On FS it takes about 6 minutes on this ancient Penryn laptop (Core 2 Duo P8600), but I eventually get my 4300-page document.
  • With OH it takes over 10 minutes... but I get no document - instead it gives up with an OutOfMemoryException:
java.lang.OutOfMemoryError: GC overhead limit exceeded
	at java.util.Arrays.copyOf(Arrays.java:3332)
	at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
	at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448)
	at java.lang.StringBuilder.append(StringBuilder.java:136)
	at java.lang.StringBuilder.append(StringBuilder.java:131)
	at com.openhtmltopdf.pdfboxout.PdfBoxFontResolver.getHashName(PdfBoxFontResolver.java:365)
	at com.openhtmltopdf.pdfboxout.PdfBoxFontResolver.resolveFont(PdfBoxFontResolver.java:342)
	at com.openhtmltopdf.pdfboxout.PdfBoxFontResolver.resolveFont(PdfBoxFontResolver.java:301)
	at com.openhtmltopdf.pdfboxout.PdfBoxFontResolver.resolveFont(PdfBoxFontResolver.java:70)
	at com.openhtmltopdf.layout.SharedContext.getFont(SharedContext.java:356)
	at com.openhtmltopdf.layout.LayoutContext.getFont(LayoutContext.java:336)
	at com.openhtmltopdf.render.InlineBox.getTextWidth(InlineBox.java:168)
	at com.openhtmltopdf.render.InlineBox.calcMinWidthFromWordLength(InlineBox.java:255)
	at com.openhtmltopdf.render.InlineBox.calcMinMaxWidth(InlineBox.java:378)
	at com.openhtmltopdf.render.BlockBox.calcMinMaxWidthInlineChildren(BlockBox.java:1688)
	at com.openhtmltopdf.render.BlockBox.calcMinMaxWidth(BlockBox.java:1562)
	at com.openhtmltopdf.newtable.TableBox$AutoTableLayout.recalcColumn(TableBox.java:1247)
	at com.openhtmltopdf.newtable.TableBox$AutoTableLayout.fullRecalc(TableBox.java:1221)
	at com.openhtmltopdf.newtable.TableBox$AutoTableLayout.calcMinMaxWidth(TableBox.java:1516)
	at com.openhtmltopdf.newtable.TableBox.calcMinMaxWidth(TableBox.java:158)
	at com.openhtmltopdf.newtable.TableBox.layout(TableBox.java:221)
	at com.openhtmltopdf.layout.BlockBoxing.layoutBlockChild0(BlockBoxing.java:321)
	at com.openhtmltopdf.layout.BlockBoxing.layoutBlockChild(BlockBoxing.java:299)
	at com.openhtmltopdf.layout.BlockBoxing.layoutContent(BlockBoxing.java:90)
	at com.openhtmltopdf.render.BlockBox.layoutChildren(BlockBox.java:985)
	at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:865)
	at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:794)
	at com.openhtmltopdf.layout.BlockBoxing.layoutBlockChild0(BlockBoxing.java:321)
	at com.openhtmltopdf.layout.BlockBoxing.layoutBlockChild(BlockBoxing.java:299)
	at com.openhtmltopdf.layout.BlockBoxing.layoutContent(BlockBoxing.java:90)
	at com.openhtmltopdf.render.BlockBox.layoutChildren(BlockBox.java:985)
	at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:865)

Test case here:
mp_test.tar.gz
Just copypaste the <table id=main> a few hundred times and you'll get a realistic test load similar to my case.

...thankfully that's the only heavyweight report on my app (and the least used one, yet... due to $REGULATIONS our customers have to generate it at least a couple times per year: the first run with an empty transaction log easily goes over 100 pages (one per department).

I've integrated parts of your sample in #194 and just did a big freemarker loop around it. Something strange is going on here in the Bidi-splitter stuff. There is one ParagraphSplitter$Paragraph object with 2.2 million entries in the textRuns hash map ... this seems to be the root object? At least it's seems strange to me that one Paragraph object can have so many entries ...

@danfickle you should be able to investigate that in my #194 pull request.

Did you disable the logging? XRLog.setLoggingEnabled(false)? Because the logging causes some overhead, even if the logger does not write the log infos somewhere, because the log infos are generated anyway.

@rototor

In regard of the bidi splitter, it defines a paragraph as a block element. It should define a paragraph as anything block-like, for example a table cell. I meant to make this trivial fix in RC-13 but somehow forgot.

@dilworks
Will do some more performance work tomorrow based on your sample.

I’ve been thinking about the painting side. The core algorithm is:

For-each page:
    For-each layer:
         For-each top-level box such as line box:
             Output if on this page.

This leads to a method call count of page-count x layer-count x box-count. Or for an 1800 page document with one layer and 50 something lines per page, about 180 million iterations, which I’ve observed in the profiler. This is essentially O(n^2). But we have a sorted list of pages, so we should be able to binary sort and get down at least to O(n log n). Or about a million iterations (200 fold decrease) for the 1800 page document. That would really speed everything up.

After testing the latest commits with my 4300+ page test data (and ensuring logging is disabled!)... well, I still get no document. Either the entire JVM gets stuck forever (even trying to attach JConsole or a profiler will stall! I have to manually kill the JVM in such cases), or it dies after several minutes with a OutOfMemoryError because it has spent too much time on GC. But this time, the stacktrace now looks different:


java.lang.OutOfMemoryError: GC overhead limit exceeded
	at com.openhtmltopdf.render.InlineLayoutBox.getBorderEdge(InlineLayoutBox.java:318)
	at com.openhtmltopdf.render.Box.getPaintingBorderEdge(Box.java:292)
	at com.openhtmltopdf.render.Box.getPaintingClipEdge(Box.java:300)
	at com.openhtmltopdf.render.Box.calcPaintingInfo(Box.java:793)
	at com.openhtmltopdf.render.Box.calcChildPaintingInfo(Box.java:808)
	at com.openhtmltopdf.render.Box.calcPaintingInfo(Box.java:796)
	at com.openhtmltopdf.render.Box.calcChildPaintingInfo(Box.java:808)
	at com.openhtmltopdf.render.BlockBox.calcChildPaintingInfo(BlockBox.java:1824)
	at com.openhtmltopdf.render.Box.calcPaintingInfo(Box.java:796)
	at com.openhtmltopdf.render.Box.calcChildPaintingInfo(Box.java:808)
	at com.openhtmltopdf.render.BlockBox.calcChildPaintingInfo(BlockBox.java:1824)
	at com.openhtmltopdf.render.Box.calcPaintingInfo(Box.java:796)
	at com.openhtmltopdf.render.Box.calcChildPaintingInfo(Box.java:808)
	at com.openhtmltopdf.render.BlockBox.calcChildPaintingInfo(BlockBox.java:1824)
	at com.openhtmltopdf.render.Box.calcPaintingInfo(Box.java:796)
	at com.openhtmltopdf.render.Box.calcChildPaintingInfo(Box.java:808)
	at com.openhtmltopdf.render.BlockBox.calcChildPaintingInfo(BlockBox.java:1824)
	at com.openhtmltopdf.render.Box.calcPaintingInfo(Box.java:796)
	at com.openhtmltopdf.render.Box.calcChildPaintingInfo(Box.java:808)
	at com.openhtmltopdf.render.BlockBox.calcChildPaintingInfo(BlockBox.java:1824)
	at com.openhtmltopdf.render.Box.calcPaintingInfo(Box.java:796)
	at com.openhtmltopdf.render.Box.calcChildPaintingInfo(Box.java:808)
	at com.openhtmltopdf.render.BlockBox.calcChildPaintingInfo(BlockBox.java:1824)
	at com.openhtmltopdf.render.Box.calcPaintingInfo(Box.java:796)
	at com.openhtmltopdf.layout.Layer.calcPaintingDimension(Layer.java:776)
	at com.openhtmltopdf.layout.Layer.getPaintingDimension(Layer.java:312)
	at com.openhtmltopdf.pdfboxout.PdfBoxRenderer.layout(PdfBoxRenderer.java:302)

I'll play with the MemoryUsageSetting stuff as a fallback for those "huge document" cases to see how things improve... although I think that in this case it might be of no help since it seems we're not even yet talking to PDFBox at all...

For those of you building from source, I've just pushed a major commit that reworks the renderer to use a display list and replaces the page-count squared algorithm with just a page count algorithm. In other words the rendering part (not the layout) for a 10000 page document will be thousands of times faster.

In my testing with a simple 1800 page document, total time went from ~100 seconds to slightly under 40 seconds.

Of course it is still very unstable and buggy, but you could try it out. See code below:

OutputStream os = new FileOutputStream("/Users/me/Documents/pdf-issues/output/mytest-180.pdf");
PdfRendererBuilder builder = new PdfRendererBuilder();
builder.withUri("file:///Users/me/Documents/pdf-issues/issue-180.htm");
builder.toStream(os);

PdfBoxRenderer renderer = builder.buildPdfRenderer();
			
renderer.layout();

// Call createPdfFast instead of createPDF.
renderer.createPdfFast(true);
			
os.close();

Decided to give a try to the fast renderer. Dozens of commits ago, my test file rendered OK, except for the small box at the top of each page (the one that says "CODIGO PRESUPUESTARIO") was rendering... ugly. No antialiasing, but heaving pixellation there. Today I revisited that part after pulling the most recent commits, and my document now looks as intended :)

Also enabled logging, and noticed a bunch of those:

13:30:54,095 SEVERE [com.openhtmltopdf.render] getClip MUST not be used by the fast renderer. Please consider reporting this bug.

Anyway, keep up the good job! I've already switched to this library on my production releases, and so far no user has reported regressions or badly rendered files. All I'm waiting now is for the fast renderer to mature.

Dear @danfickle, how I can generate one PDF file with 2 pages from 2 html templates?

Yeah, RC18 is finally released with a usable fast renderer.

I am now using version 1.0.2, but the pdf build is still hang.
The size of html is 13241929
I have tried many times and increased the heap size to 4G.
My running machine is i5 4460, 16G RAM.

Attafched with the test html
test.txt

My code for pdf generation is as follow:

    public byte[] generateFromHtml(String html) throws Exception {
        try (ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream()) {
            PdfRendererBuilder builder = new PdfRendererBuilder();
            builder.useFont(getFont(PMingLiU), "PMingLiU");
            builder.useFont(getFont(PMingLiUExtB), "PMingLiU-ExtB");
            builder.useFont(getFont(seguiemj), "Segoe UI Emoji");
            builder.withHtmlContent(html, null);
            builder.useFastMode();
            builder.toStream(byteArrayOutputStream);
            builder.run();
            return byteArrayOutputStream.toByteArray();
        }
    }