danfickle/openhtmltopdf

Using usePdfAConformance resulting in missing fonts and other attributes

mattstjean opened this issue · 8 comments

Before reading this, I only now saw the disclaimer of "Note: This is pre-release documentation. PDF/UA support will be released with RC-18."...So if this isn't supported yet, I'm sorry for the issue. --- Is there a timeline for RC-18? The project I'm working on requires the PDFs be compliant.

Summary
I am having an issue where when I add the line:
builder.usePdfAConformance(PdfAConformance.PDFA_1_A);
it causes my PDF to render blank, which I am assuming is due to the now-missing fonts.

I am also having issues when trying to include the line:
builder.useFastMode();
it causes the PDF to lose the Author attribute (only when used with PdfAConformance

Let me know if there is anything additional I can provide to get help with this.

Background

  • useFastMode() without usePdfAConformance

    • PDF is not tagged
    • Contains title, author, subject, description, and fonts
    • PDF does not contain language
    • Content is properly displayed
  • usePdfAConformance(PdfAConformance.PDFA_1_A) without useFastMode()

    • PDF is tagged
    • Displays a compliance notice when opened with Adobe Reader
    • Contains title, subject, description
    • PDF does not contain author, language, or fonts
    • Content is not displayed
  • useFastMode() and usePdfAConformance(PdfAConformance.PDFA_1_A)

    • PDF is not tagged (unexpected, PDF should be tagged)
    • Does not display a compliance notice when opened with Adobe Reader (unexpected, PDF should be claiming compliance)
    • Contains title, author, subject, description
    • Does not contain fonts or language
    • Content is not displayed
  • Neither useFastMode() nor usePdfAConformance(PdfAConformance.PDFA_1_A)

    • PDF is not tagged
    • Contains title, author, subject, description, fonts
    • No language
    • Content is displayed

Application Info

  • Generate HTML using freemarker to merge data with HTML template (resulting HTML is a string and not a file)
  • Generate PDF, I have tried this two ways based on examples I've found. I return a byte array because this is part of a webservice that receives JSON data and returns a PDF representation of the data.

Everything has been working perfectly, I've only run into issues when trying to make the application

Implementation 1

public byte[] generatePdf(final String html) throws Exception {
        System.out.println("in generate pdf");
        PdfRendererBuilder builder = new PdfRendererBuilder();
        builder.useFastMode();
        builder.usePdfAConformance(PdfAConformance.PDFA_1_A);

        Map<String, String> fonts = FontHelper.getFonts(true);
        fonts.forEach( (k, v) -> {
            if (k.contains("Bold") && k.contains("Italic")) {
                builder.useFont(new File(k), v, 700, FontStyle.ITALIC, true);
            } else if (k.contains("Bold")) {
                builder.useFont(new File(k), v, 700, FontStyle.NORMAL, true);
            } else if (k.contains("Italic")) {
                builder.useFont(new File(k), v, 400, FontStyle.ITALIC, true);
            } else if (k.contains("Regular")) {
                builder.useFont(new File(k), v, 400, FontStyle.NORMAL, true);
            }
        });

        builder.withHtmlContent(html, "/");
        ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
        builder.toStream(outputStream);

        builder.run();

        outputStream.close();
        return outputStream.toByteArray();
    }

Implementation 2

public byte[] generatePdf(final String html) throws IOException {
        System.out.println("in generate pdf");
        PdfRendererBuilder builder = new PdfRendererBuilder();

        builder.withHtmlContent(html, "/");
        ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
        builder.toStream(outputStream);
        try (PdfBoxRenderer pdfBoxRenderer = builder.buildPdfRenderer()) {
            pdfBoxRenderer.layout();
            pdfBoxRenderer.createPDF();
            pdfBoxRenderer.close();
        }
        outputStream.close();
        return outputStream.toByteArray();
    }

I have been doing this with a simplified HTML template until I get it to work before I switch back to my real template:

<html lang="EN-US">
    <head>
        <title>Example Title</title>
        <meta name="subject" content="Example Subject" />
        <meta name="author" content="Example Author" />
        <meta name="description" content="Example Description"/>

        <bookmarks>
            <bookmark name="First" href="#first" />
            <bookmark name="Second" href="#second" />
            <bookmark name="Third" href="#third" />
            <bookmark name="Fourth" href="#fourth" />
        </bookmarks>

        <style>
            .noto {
                font-family: "Noto Sans";
            }
            body {
                font-family: "Noto Sans";
            }
        </style>
    </head>
    <body>
        <h1> Title </h1>
        <h2> Subtitle </h2>
        <h3 id="first">Section 1 - First</h3>
        <div>asdoasok</div>
        <h3 id="second">Section 2 - Second</h3>
        <div>asodaokasd</div>
        <h3 id="third">Section 3 - Third</h3>
        <div>asdaklsdpkasd</div>
        <h3 id="fourth">Section 4 - Fourth</h3>
        <div>asodjasojaosdj</div>
    </body>
</html>

In case you're interested in my Font strategy...it's essentially a copy-and-paste of one of the examples.

My fonts are located in my classpath: "project-root/src/main/resources/fonts"

public static Map<String, String> getFonts(boolean showErrors) {

        Map<String, String> fonts = new HashMap<String, String>();

        File fod = new File("src/main/resources/fonts");
        
        List<File> fontFiles = new ArrayList<File>();

        if (fod.isDirectory()) {
            fontFiles.addAll(Arrays.asList(fod.listFiles(new FilenameFilter(){
                public boolean accept(File file, String s) {
                    return s.endsWith(".ttf");
                }
            })));
        } else {
            fontFiles.add(fod);
        }

        System.out.println("Font files: " + fontFiles);

        List<String> errors = new ArrayList<String>();
        for (Iterator<File> fit = fontFiles.iterator(); fit.hasNext();) {
            File f = (File) fit.next();
            Font awtf = null;
            try {
                awtf = Font.createFont(Font.TRUETYPE_FONT, f);
            } catch (FontFormatException e) {
                log.error("Trying to load font via AWT: " + e.getMessage());
            } catch (IOException e) {
                log.error("Trying to load font via AWT: " + e.getMessage());
            }
            try {
                log.info("Font located at " + f.getPath() + "\n" +
                         " family name (reported by AWT): " + awtf.getFamily());
                fonts.put(f.getPath(), awtf.getFamily());
            } catch (RuntimeException e) {
                if (e.getMessage().contains("not a valid TTF or OTF file.")) {
                    errors.add(e.getMessage());
                } else if (e.getMessage().contains("Table 'OS/2' does not exist")) {
                    errors.add(e.getMessage());
                } else if (e.getMessage().contains("licensing restrictions.")) {
                    errors.add(e.getMessage());
                } else {
                    throw e;
                }
            }
        }
        if (errors.size() > 0) {
            if (showErrors) {
                log.error("Errors were reported on reading some font files.");
                for (Iterator<String> eit = errors.iterator(); eit.hasNext();) {
                    log.error(eit.next());
                }
            } else {
                log.error("Errors were reported on reading some font files. Pass true as an argument to show them, and re-call");
            }
        }

        return fonts;
    }

Hi @mattstjean,

Thanks for the detailed write-up!

In regards to fonts, I think you're falling victim to #324. Either the font is not under that name (Noto Sans) or an exception is being thrown when PDFBOX loads it and silently discarded.

You could put something like this in a main method to check if it throwing:

PDDocument doc = new PDDocument();
try {
     PDType0Font.load(doc, new File("/path/to/font.ttf"));
} catch (Exception e) {
     e.printStackTrace();
}

As to the rest, I've just added a PDF/A testing module using VeraPDF. I used the following code to create the PDF:

        byte[] pdfBytes;
        
        try (PDDocument doc = new PDDocument()) {
            PdfRendererBuilder builder = new PdfRendererBuilder();
            builder.usePDDocument(doc);
            builder.useFastMode();
            //builder.testMode(true);
            builder.usePdfAConformance(conform);
            builder.useFont(new File("target/test/artefacts/Karla-Bold.ttf"), "TestFont");
            builder.withHtmlContent(html, PdfATester.class.getResource("/html/").toString());
    
            try (PdfBoxRenderer renderer = builder.buildPdfRenderer()) {
                renderer.createPDFWithoutClosing();
            }
    
            try (InputStream colorProfile = PdfATester.class.getResourceAsStream("/colorspaces/sRGB.icc")) {
                PDOutputIntent oi = new PDOutputIntent(doc, colorProfile); 
                oi.setInfo("sRGB IEC61966-2.1"); 
                oi.setOutputCondition("sRGB IEC61966-2.1"); 
                oi.setOutputConditionIdentifier("sRGB IEC61966-2.1"); 
                oi.setRegistryName("http://www.color.org"); 
                doc.getDocumentCatalog().addOutputIntent(oi);
            }
        
            ByteArrayOutputStream baos = new ByteArrayOutputStream();
            doc.save(baos);
            pdfBytes = baos.toByteArray();
        }

Note: I got the color space file from:
https://svn.apache.org/viewvc/pdfbox/trunk/examples/src/main/resources/org/apache/pdfbox/resources/pdfa/

The test reports the following problems:

DISTINCT ERRORS(all-in-one--1a) (4): [
    An annotation dictionary shall contain the F key. The F key’s Print flag bit shall be set to 1 and its Hidden, Invisible and NoView flag bits shall be set to 0
    root/document[0]/pages[0](9 0 obj PDPage)/annots[1](17 0 obj PDAnnot)
    If a document information dictionary does appear at a document, then all of its entries that have analogous properties in predefined XMP schemas, shall also be embedded in the file in XMP form with equivalent values.
    root
    If an Image dictionary contains the Interpolate key, its value shall be false
    root/document[0]/pages[0](9 0 obj PDPage)/contentStream[0](14 0 obj PDContentStream)/operators[203]/xObject[0](23 0 obj PDXImage)
    An XObject dictionary shall not contain the SMask key
    root/document[0]/pages[0](9 0 obj PDPage)/contentStream[0](14 0 obj PDContentStream)/operators[203]/xObject[0](23 0 obj PDXImage)
]

The XMP issue is probably where author is going. They appear to be all simple to fix, except for the SMask issue which is used to implement transparency in images. I guess for now, we could advise people not to use transparent PNGs?

In addition, PDF/A1a requires proper tagging. Fortunately, I've just implemented that for PDF/UA so that shouldn't be hard to get working.

UPDATE:

We are now compliant with PDF/A standards 1 and 2, except for PDF/A1a when using tables. This is because we are using the TFoot, TBody and THead structure types which were only introduced with PDF standard 1.5 (PDF/A1 is based on PDF 1.4).

So I'll have to find a way to factor out their use and then I can finally release RC-18.

Additionally, I forgot that we have a builder method to input the color profile, so updated code to use PDF/A standards is something like:

            PdfRendererBuilder builder = new PdfRendererBuilder();
            builder.useFastMode();
            //builder.testMode(true);
            builder.usePdfAConformance(conform);
            builder.useFont(new File("target/test/artefacts/Karla-Bold.ttf"), "TestFont");
            builder.withHtmlContent(html, PdfATester.class.getResource("/html/").toString());
    
            try (InputStream colorProfile = PdfATester.class.getResourceAsStream("/colorspaces/sRGB.icc")) {
                byte[] colorProfileBytes = IOUtils.toByteArray(colorProfile);
                builder.useColorProfile(colorProfileBytes);
            }
        
            ByteArrayOutputStream baos = new ByteArrayOutputStream();
            builder.toStream(baos);
            builder.run();

We're now PDF/A1a compliant as well. I've written up some guidelines for PDF/A compliance in the wiki.

Note in the example, the addition of this line:

builder.usePdfVersion(conform.getPart() == 1 ? 1.4f : 1.5f);

I think we can now close this issue. I'll release RC18 this week. Please re-open if you find any more issues with PDF/A. Thanks @mattstjean.

Please re-open if you find any more issues with PDF/A. Thanks @mattstjean.

@danfickle : The initial issue (fonts are missing) still seems to appear if we use builder.usePdfUaAccessbility(true)
However, since this bug was about PDF/A compliancy and not necessarily PDF/UA, would that be a separate bug?

Hi @mattstjean,

We can discuss here. Firstly, just making sure you know that src/main/resources will not be a directory when your project is compiled into a jar?

Thank you for all of the help, @danfickle . Sorry about the delay in responding, I've been very busy and wanted to try it out before responding.

I figured out my main issue with the fonts.

The first fix was to get them properly (I had been trying all different variants because I wasn't sure why it wasn't working). I landed with:

ClassLoader classLoader = getClass().getClassLoader();
File regFile = new File(classLoader.getResource("fonts/NotoSans-Regular.ttf").getFile());
builder.useFont(regFile, "noto", 400, FontStyle.NORMAL, true);

Then I hit a snag and needed a second fix that wasn't as obvious to me. It was actually caused by the way I had the page counter set up. It's not in my initial example above because I added it after the fonts worked. The way I had it was:

@bottom-right {
    content: 'Page ' counter(page) ' of ' counter(pages);
}

I had it like that and then like:

@bottom-right {
    content: 'Page ' counter(page) ' of ' counter(pages);
    font-family: 'noto', sans-serif;
    font-size: 12;
}

Both of those didn't work and I was getting a lot of errors saying "Font list empty" or something similar. When I changed it to

@bottom-right {
    font-family: 'noto', sans-serif;
    font-size: 12;
    content: 'Page ' counter(page) ' of ' counter(pages);
}

it worked. In your html examples you have it that way too, so when I was doing a manual diff between my html and yours - I finally figured it out.


I am having an issue now where the document language isn't getting set. When I run the adobe acrobat pro dc accessibility full check, it catches 2 fails:

Primary language | Failed | Text language is specified
Title | Failed | Document title is showing in title bar

The title I'm not super worried about because when I look at the document properties it does have a value for title. The thing that I'm trying to figure out is why language is not getting set. I get the same 2 fails when I run it on a PDF generated from your all-in-one.html test file.

<html lang="EN-US">
    <head>
        <title>Summary</title>
        <meta name="subject" content="Summary" />
        <meta name="author" content="Business" />
        <meta name="description" content="Request Summary"/>

Let me know if I should open a different issue.

Closing in favor of #347. The order of properties situation is bizarre. Not sure what is happening.