danfickle/openhtmltopdf

Creating PDF/A-3 document raises NPE

ivanbogicevickg opened this issue · 10 comments

If I try to generate PDF/A-3 document from html I get the following exception:

Exception in thread "main" java.lang.NullPointerException
	at org.apache.pdfbox.cos.COSArray.add(COSArray.java:62)
	at com.openhtmltopdf.pdfboxout.PdfBoxAccessibilityHelper.finishNumberTree(PdfBoxAccessibilityHelper.java:744)
	at com.openhtmltopdf.pdfboxout.PdfBoxFastOutputDevice.finish(PdfBoxFastOutputDevice.java:875)
	at com.openhtmltopdf.pdfboxout.PdfBoxRenderer.writePDFFast(PdfBoxRenderer.java:661)
	at com.openhtmltopdf.pdfboxout.PdfBoxRenderer.createPdfFast(PdfBoxRenderer.java:550)
	at com.openhtmltopdf.pdfboxout.PdfBoxRenderer.createPDF(PdfBoxRenderer.java:468)
	at com.openhtmltopdf.pdfboxout.PdfBoxRenderer.createPDFWithoutClosing(PdfBoxRenderer.java:395)
	at com.dm.reviscan.emails.EmailToPDF.main(EmailToPDF.java:90)

This is a code snipped I'm using to generate PDF:

PdfRendererBuilder builder = new PdfRendererBuilder();
builder.usePDDocument(pdfDoc);
builder.withW3cDocument(new W3CDom().fromJsoup(htmlDoc), outFile.toURI().toURL().toString());
builder.useFastMode();
builder.useDefaultPageSize(210, 297, BaseRendererBuilder.PageSizeUnits.MM);
builder.useHttpStreamImplementation(new OkHttpStreamFactory());
builder.usePdfAConformance(PdfRendererBuilder.PdfAConformance.PDFA_3_A);
builder.usePdfVersion(1.5f);
builder.usePdfUaAccessbility(false);
builder.useFont(new File("c:\\Windows\\Fonts\\DejaVuSans.ttf"), "ArialMT");
builder.useFont(new File("c:\\Windows\\Fonts\\DejaVuSans.ttf"), "Arial-BoldMT");
builder.useFont(new File("c:\\Windows\\Fonts\\DejaVuSans.ttf"), "Times-Roman");
builder.useFont(new File("c:\\Windows\\Fonts\\DejaVuSans.ttf"), "Times-Bold");
try (InputStream colorProfile = EmailToPDF.class.getResourceAsStream("/sRGB.icc")) {
  byte[] colorProfileBytes = IOUtils.toByteArray(colorProfile);
  builder.useColorProfile(colorProfileBytes);
}

If I comment out builder.usePdfAConformance(PdfRendererBuilder.PdfAConformance.PDFA_3_A); document is generated, but this is not what I want.

Hi @ivanbogicevickg,

It seems that it is trying to create a structure element without a parent. This is a bug, but is very hard to track down without sample HTML (hopefully minimal).

It should be something not used in the sample document. I hope that helps to narrow down the offending tag.

Hi @danfickle,

Here is a smallest html sample.
email-attach-1.txt
This only happens when I turn on PDFA conformance, otherwise PDF is generated without a problem.

Hi @ivanbogicevickg,

I couldn't replicate this, even going back to the version 1 release. Here is the code I used:

    public static void main(String... args) throws Exception {
        PdfRendererBuilder builder = new PdfRendererBuilder();
        File inFile = new File("/Users/me/Documents/pdf-issues/issue-401.htm");
        org.jsoup.nodes.Document doc = Jsoup.parse(inFile, "UTF-8");
        builder.withW3cDocument(new W3CDom().fromJsoup(doc), inFile.toURI().toURL().toString());
        // DON'T DO THIS (not closing stream): Throw-away code ahead!
        builder.toStream(new FileOutputStream("/Users/me/Documents/pdf-issues/output/issue-401.pdf"));
        builder.useFastMode();
        builder.useDefaultPageSize(210, 297, PdfRendererBuilder.PageSizeUnits.MM);
        builder.usePdfAConformance(PdfRendererBuilder.PdfAConformance.PDFA_3_A);
        builder.usePdfVersion(1.5f);
        builder.useFont(new File("/Users/me/Documents/pdf-issues/fonts/JustAnotherHand.ttf"), "default");
        builder.run();
    }

and the only change I made to the HTML was to add: style="font-family: 'default';" to the body element. Are you perhaps using a stylesheet that is changing things?

swarl commented

Hi @danfickle

I can reproduce the error with this piece of html which is derived from our production server and generated by domino and spiced up on our side with some styles:

<html>
  <head>
    <meta content="text/html; charset=iso-8859-1" http-equiv="Content-Type">
    <style>
      * {
        font-family: 'Liberation Sans';
      }
    </style>
  </head>
  <body>
    <ul><b>To:</b></ul>
  </body>
</html>

Noto Fonts: https://www.google.com/get/noto/

builder.withW3cDocument(new W3CDom().fromJsoup(Jsoup.parse(htmlContent)), "");
[...]
builder.useFont(() -> PdfRenderer.class.getClassLoader().getResourceAsStream("org/apache/pdfbox/resources/ttf/LiberationSans-Regular.ttf"),
            "Liberation Sans");

Problem: the ul-tag comes without embedded li-tags. With the combination of any font and the imho not valid HTML the NPE occurs. I have no clue what's wrong with this combination.

Is this fixable?

Thanks and Greetings

swarl commented

Workaround: turn off fast mode

//builder.useFastMode();

Like this PDF is generated correctly

Hi @swarl ,

Thanks for the reproducible example, however, I'm not sure what to do with this one. As you suggest, the HTML is incorrect and a pretty explicit log message is provided:

com.openhtmltopdf.general WARNING:: Trying to add incompatible child to parent item: child type=GenericStructualElement, parent type=ListStructualElement, expected child type=ListItemStructualElement. Document will not be PDF/UA compliant.

The question in my mind is given the resources working on this project (ie. not much) is it reasonable to try to produce a document given any number of combinations of invalid input?

In this case, we could produce a visually valid PDF, but it would not be PDF/A 3a valid as it would not have the structure tree required by screen readers and the standard.

P.S-1 Adding the font causes the NPE to trigger as without the font, no text can be output. Remember, the PDF/A standards disable the built-in fonts.

P.S-2 Turning off fast mode means the document will not be PDF/A compliant as it is only implemented in the newer fast renderer.

swarl commented

Hi @danfickle

One option would be to remove the faulty tag, but not its content so that I would end up with

<body>
    <b>To:</b>
</body>

Would be in my eyes a better solution then not to render anything.

One question about turning of fastMode: Why Acrobat Reader still tells me that it's a PDF/A when it is not?
image

Thanks for your work. Appreciating it very much
Joe

swarl commented

And an other "solution" could be to change the parent tag into <div...> which can take whatever content...

swarl commented

Hi @danfickle
For me logging a message when the logic will later throw a NPE is not really transparent behavior.

What about a straight forward solution:

        @Override
        void addChild(AbstractTreeItem child) {
            if (child instanceof ListItemStructualElement) {
                listItems.add((ListItemStructualElement) child);
            } else {
                ListItemStructualElement listItemStructualElement = new ListItemStructualElement();
                listItemStructualElement.addChild(child);
                listItems.add(listItemStructualElement);
                logIncompatibleChild(this, child, ListItemStructualElement.class);
            }
        }

Greetings
Joe

swarl commented

OR: you don't care about broken HTML and just throw a meaningful exception.
You could of course try to add some advice. I just tried some options to repair the broken html. Found a (barly) unmaintained project on github: https://github.com/jtidy/jtidy

        <dependency>
            <groupId>com.github.jtidy</groupId>
            <artifactId>jtidy</artifactId>
            <version>1.0.2</version>
        </dependency>
        try (InputStream targetStream = new ByteArrayInputStream(htmlContent.getBytes(StandardCharsets.UTF_8));
             ByteArrayOutputStream destinationStream = new ByteArrayOutputStream()) {
            PdfRendererBuilder builder = new PdfRendererBuilder();
            builder.usePdfAConformance(PdfRendererBuilder.PdfAConformance.PDFA_2_A);
            builder.useFont(() -> AppTest.class.getClassLoader().getResourceAsStream("org/apache/pdfbox/resources/ttf/LiberationSans-Regular.ttf"),
                    "Liberation Sans");

            Tidy tidy = new Tidy();
            tidy.setDropProprietaryTags(false);
            tidy.setInputEncoding(StandardCharsets.UTF_8.name());

            tidy.parse(targetStream, destinationStream);
            String cleanedHtml = destinationStream.toString(StandardCharsets.UTF_8.name());

            builder.withW3cDocument(new W3CDom().fromJsoup(Jsoup.parse(cleanedHtml)), "");

Added to the documentation and referenced in the exceptions message would be a valid solution to me.

So, finally, I would have four options:

  1. Throw a meaningful exception, that HTML is broken
  2. Option 1 with some advice how to fix in documentation
  3. Fix the broken HTML by trying to add the missing parent tag manually (#401 (comment))
  4. Option 3 with a switch to enable this behavior if wished

Happy easter :-)
Joe