Creating PDF/A-3 document raises NPE
ivanbogicevickg opened this issue · 10 comments
If I try to generate PDF/A-3 document from html I get the following exception:
Exception in thread "main" java.lang.NullPointerException
at org.apache.pdfbox.cos.COSArray.add(COSArray.java:62)
at com.openhtmltopdf.pdfboxout.PdfBoxAccessibilityHelper.finishNumberTree(PdfBoxAccessibilityHelper.java:744)
at com.openhtmltopdf.pdfboxout.PdfBoxFastOutputDevice.finish(PdfBoxFastOutputDevice.java:875)
at com.openhtmltopdf.pdfboxout.PdfBoxRenderer.writePDFFast(PdfBoxRenderer.java:661)
at com.openhtmltopdf.pdfboxout.PdfBoxRenderer.createPdfFast(PdfBoxRenderer.java:550)
at com.openhtmltopdf.pdfboxout.PdfBoxRenderer.createPDF(PdfBoxRenderer.java:468)
at com.openhtmltopdf.pdfboxout.PdfBoxRenderer.createPDFWithoutClosing(PdfBoxRenderer.java:395)
at com.dm.reviscan.emails.EmailToPDF.main(EmailToPDF.java:90)
This is a code snipped I'm using to generate PDF:
PdfRendererBuilder builder = new PdfRendererBuilder();
builder.usePDDocument(pdfDoc);
builder.withW3cDocument(new W3CDom().fromJsoup(htmlDoc), outFile.toURI().toURL().toString());
builder.useFastMode();
builder.useDefaultPageSize(210, 297, BaseRendererBuilder.PageSizeUnits.MM);
builder.useHttpStreamImplementation(new OkHttpStreamFactory());
builder.usePdfAConformance(PdfRendererBuilder.PdfAConformance.PDFA_3_A);
builder.usePdfVersion(1.5f);
builder.usePdfUaAccessbility(false);
builder.useFont(new File("c:\\Windows\\Fonts\\DejaVuSans.ttf"), "ArialMT");
builder.useFont(new File("c:\\Windows\\Fonts\\DejaVuSans.ttf"), "Arial-BoldMT");
builder.useFont(new File("c:\\Windows\\Fonts\\DejaVuSans.ttf"), "Times-Roman");
builder.useFont(new File("c:\\Windows\\Fonts\\DejaVuSans.ttf"), "Times-Bold");
try (InputStream colorProfile = EmailToPDF.class.getResourceAsStream("/sRGB.icc")) {
byte[] colorProfileBytes = IOUtils.toByteArray(colorProfile);
builder.useColorProfile(colorProfileBytes);
}
If I comment out builder.usePdfAConformance(PdfRendererBuilder.PdfAConformance.PDFA_3_A);
document is generated, but this is not what I want.
Hi @ivanbogicevickg,
It seems that it is trying to create a structure element without a parent. This is a bug, but is very hard to track down without sample HTML (hopefully minimal).
It should be something not used in the sample document. I hope that helps to narrow down the offending tag.
Hi @danfickle,
Here is a smallest html sample.
email-attach-1.txt
This only happens when I turn on PDFA conformance, otherwise PDF is generated without a problem.
Hi @ivanbogicevickg,
I couldn't replicate this, even going back to the version 1 release. Here is the code I used:
public static void main(String... args) throws Exception {
PdfRendererBuilder builder = new PdfRendererBuilder();
File inFile = new File("/Users/me/Documents/pdf-issues/issue-401.htm");
org.jsoup.nodes.Document doc = Jsoup.parse(inFile, "UTF-8");
builder.withW3cDocument(new W3CDom().fromJsoup(doc), inFile.toURI().toURL().toString());
// DON'T DO THIS (not closing stream): Throw-away code ahead!
builder.toStream(new FileOutputStream("/Users/me/Documents/pdf-issues/output/issue-401.pdf"));
builder.useFastMode();
builder.useDefaultPageSize(210, 297, PdfRendererBuilder.PageSizeUnits.MM);
builder.usePdfAConformance(PdfRendererBuilder.PdfAConformance.PDFA_3_A);
builder.usePdfVersion(1.5f);
builder.useFont(new File("/Users/me/Documents/pdf-issues/fonts/JustAnotherHand.ttf"), "default");
builder.run();
}
and the only change I made to the HTML was to add: style="font-family: 'default';"
to the body
element. Are you perhaps using a stylesheet that is changing things?
Hi @danfickle
I can reproduce the error with this piece of html which is derived from our production server and generated by domino and spiced up on our side with some styles:
<html>
<head>
<meta content="text/html; charset=iso-8859-1" http-equiv="Content-Type">
<style>
* {
font-family: 'Liberation Sans';
}
</style>
</head>
<body>
<ul><b>To:</b></ul>
</body>
</html>
Noto Fonts: https://www.google.com/get/noto/
builder.withW3cDocument(new W3CDom().fromJsoup(Jsoup.parse(htmlContent)), "");
[...]
builder.useFont(() -> PdfRenderer.class.getClassLoader().getResourceAsStream("org/apache/pdfbox/resources/ttf/LiberationSans-Regular.ttf"),
"Liberation Sans");
Problem: the ul-tag comes without embedded li-tags. With the combination of any font and the imho not valid HTML the NPE occurs. I have no clue what's wrong with this combination.
Is this fixable?
Thanks and Greetings
Workaround: turn off fast mode
//builder.useFastMode();
Like this PDF is generated correctly
Hi @swarl ,
Thanks for the reproducible example, however, I'm not sure what to do with this one. As you suggest, the HTML is incorrect and a pretty explicit log message is provided:
com.openhtmltopdf.general WARNING:: Trying to add incompatible child to parent item: child type=GenericStructualElement, parent type=ListStructualElement, expected child type=ListItemStructualElement. Document will not be PDF/UA compliant.
The question in my mind is given the resources working on this project (ie. not much) is it reasonable to try to produce a document given any number of combinations of invalid input?
In this case, we could produce a visually valid PDF, but it would not be PDF/A 3a valid as it would not have the structure tree required by screen readers and the standard.
P.S-1 Adding the font causes the NPE to trigger as without the font, no text can be output. Remember, the PDF/A standards disable the built-in fonts.
P.S-2 Turning off fast mode means the document will not be PDF/A compliant as it is only implemented in the newer fast renderer.
Hi @danfickle
One option would be to remove the faulty tag, but not its content so that I would end up with
<body>
<b>To:</b>
</body>
Would be in my eyes a better solution then not to render anything.
One question about turning of fastMode: Why Acrobat Reader still tells me that it's a PDF/A when it is not?
Thanks for your work. Appreciating it very much
Joe
And an other "solution" could be to change the parent tag into <div...> which can take whatever content...
Hi @danfickle
For me logging a message when the logic will later throw a NPE is not really transparent behavior.
What about a straight forward solution:
@Override
void addChild(AbstractTreeItem child) {
if (child instanceof ListItemStructualElement) {
listItems.add((ListItemStructualElement) child);
} else {
ListItemStructualElement listItemStructualElement = new ListItemStructualElement();
listItemStructualElement.addChild(child);
listItems.add(listItemStructualElement);
logIncompatibleChild(this, child, ListItemStructualElement.class);
}
}
Greetings
Joe
OR: you don't care about broken HTML and just throw a meaningful exception.
You could of course try to add some advice. I just tried some options to repair the broken html. Found a (barly) unmaintained project on github: https://github.com/jtidy/jtidy
<dependency>
<groupId>com.github.jtidy</groupId>
<artifactId>jtidy</artifactId>
<version>1.0.2</version>
</dependency>
try (InputStream targetStream = new ByteArrayInputStream(htmlContent.getBytes(StandardCharsets.UTF_8));
ByteArrayOutputStream destinationStream = new ByteArrayOutputStream()) {
PdfRendererBuilder builder = new PdfRendererBuilder();
builder.usePdfAConformance(PdfRendererBuilder.PdfAConformance.PDFA_2_A);
builder.useFont(() -> AppTest.class.getClassLoader().getResourceAsStream("org/apache/pdfbox/resources/ttf/LiberationSans-Regular.ttf"),
"Liberation Sans");
Tidy tidy = new Tidy();
tidy.setDropProprietaryTags(false);
tidy.setInputEncoding(StandardCharsets.UTF_8.name());
tidy.parse(targetStream, destinationStream);
String cleanedHtml = destinationStream.toString(StandardCharsets.UTF_8.name());
builder.withW3cDocument(new W3CDom().fromJsoup(Jsoup.parse(cleanedHtml)), "");
Added to the documentation and referenced in the exceptions message would be a valid solution to me.
So, finally, I would have four options:
- Throw a meaningful exception, that HTML is broken
- Option 1 with some advice how to fix in documentation
- Fix the broken HTML by trying to add the missing parent tag manually (#401 (comment))
- Option 3 with a switch to enable this behavior if wished
Happy easter :-)
Joe