danfickle/openhtmltopdf

Bookmarks are not working with JSoup

Milchreis opened this issue · 2 comments

If you generate a PDF with bookmarks with JSoup the bookmarks are not included and no error message is thrown.

Example

String html = "<html>\n" +
        "<head>\n" +
        "<style>\n" +
        "div {\n" +
        "\tpage-break-after: always;\n" +
        "}\n" +
        "#toc {\n" +
        "    width: 100%;\n" +
        "    border-collapse: collapse;\n" +
        "}\n" +
        "#toc .page-number::after {\n" +
        "  /* SPECIAL STUFF HERE! */\n" +
        "  content: target-counter(attr(href), page);\n" +
        "  width: 30px;\n" +
        "}\n" +
        "</style>\n" +
        "<bookmarks>\n" +
        "  <bookmark name=\"Title of element on page 1\" href=\"#page-1\"/>\n" +
        "  <bookmark name=\"Title of element on page 2\" href=\"#page-2\"/>\n" +
        "  <bookmark name=\"Title of element on page 3\" href=\"#page-3\"/>\n" +
        "  <bookmark name=\"Title of element on page 4\" href=\"#page-4\"/>\n" +
        "</bookmarks>\n" +
        "</head>\n" +
        "<body>\n" +
        "<h1>Bookmarks and TOC example</h1>\n" +
        "\n" +
        "<h2>TOC</h2>\n" +
        "<table id=\"toc\">\n" +
        "  <tr><td><a href=\"#page-1\">Title of element on page</a></td><td class=\"page-number\" href=\"#page-1\"></td></tr>\n" +
        "  <tr><td><a href=\"#page-2\">Title of element on page</a></td><td class=\"page-number\" href=\"#page-2\"></td></tr>\n" +
        "  <tr><td><a href=\"#page-3\">Title of element on page</a></td><td class=\"page-number\" href=\"#page-3\"></td></tr>\n" +
        "  <tr><td><a href=\"#page-4\">Title of element on page</a></td><td class=\"page-number\" href=\"#page-4\"></td></tr>\n" +
        "</table>\n" +
        "\n" +
        "<div id=\"page-1\">Page 1</div>\n" +
        "<div id=\"page-2\">Page 2</div>\n" +
        "<div id=\"page-3\">Page 3</div>\n" +
        "<div id=\"page-4\">Page 4</div>\n" +
        "\n" +
        "</body>\n" +
        "</html>";

org.jsoup.nodes.Document doc = Jsoup.parse(html);

PdfRendererBuilder builder = new PdfRendererBuilder();
builder.withW3cDocument(new W3CDom().fromJsoup(doc), null);
// builder.withHtmlContent(html, null);  <- this works with bookmarks

builder.toStream(output);
builder.run();
syjer commented

hi @Milchreis ,

this is caused by the html5 parsing rules.

if you do a doc.getOuterHtml(); you will get the following representation:

<html>
 <head> 
  <style>
div {
	page-break-after: always;
}
#toc {
    width: 100%;
    border-collapse: collapse;
}
#toc .page-number::after {
  /* SPECIAL STUFF HERE! */
  content: target-counter(attr(href), page);
  width: 30px;
}
</style> 
 </head>
 <body>
  <bookmarks> 
   <bookmark name="Title of element on page 1" href="#page-1" /> 
   <bookmark name="Title of element on page 2" href="#page-2" /> 
   <bookmark name="Title of element on page 3" href="#page-3" /> 
   <bookmark name="Title of element on page 4" href="#page-4" /> 
  </bookmarks>   
  <h1>Bookmarks and TOC example</h1> 
  <h2>TOC</h2> 
  <table id="toc"> 
   <tbody>
    <tr>
     <td><a href="#page-1">Title of element on page</a></td>
     <td class="page-number" href="#page-1"></td>
    </tr> 
    <tr>
     <td><a href="#page-2">Title of element on page</a></td>
     <td class="page-number" href="#page-2"></td>
    </tr> 
    <tr>
     <td><a href="#page-3">Title of element on page</a></td>
     <td class="page-number" href="#page-3"></td>
    </tr> 
    <tr>
     <td><a href="#page-4">Title of element on page</a></td>
     <td class="page-number" href="#page-4"></td>
    </tr> 
   </tbody>
  </table> 
  <div id="page-1">
   Page 1
  </div> 
  <div id="page-2">
   Page 2
  </div> 
  <div id="page-3">
   Page 3
  </div> 
  <div id="page-4">
   Page 4
  </div>  
 </body>
</html>

You can notice how the bookmarks have been moved from the head to the body. In the code, we can see that it will fetch the bookmarks only in the head ( DOMUtil.getChild(head, "bookmarks") ) so they will not be found.

Note: I've tried with my html5 parser (https://github.com/digitalfondue/jfiveparse) and the output is a little different(note how the self closing "bookmark" elements are interpreted):

<html><head>
<style>
div {
	page-break-after: always;
}
#toc {
    width: 100%;
    border-collapse: collapse;
}
#toc .page-number::after {
  /* SPECIAL STUFF HERE! */
  content: target-counter(attr(href), page);
  width: 30px;
}

</style>
</head><body><bookmarks>
  <bookmark name="Title of element on page 1" href="#page-1">
  <bookmark name="Title of element on page 2" href="#page-2">
  <bookmark name="Title of element on page 3" href="#page-3">
  <bookmark name="Title of element on page 4" href="#page-4">
</bookmark></bookmark></bookmark></bookmark></bookmarks>


<h1>Bookmarks and TOC example</h1>

<h2>TOC</h2>
<table id="toc">
  <tbody><tr><td><a href="#page-1">Title of element on page</a></td><td class="page-number" href="#page-1"></td></tr>
  <tr><td><a href="#page-2">Title of element on page</a></td><td class="page-number" href="#page-2"></td></tr>
  <tr><td><a href="#page-3">Title of element on page</a></td><td class="page-number" href="#page-3"></td></tr>
  <tr><td><a href="#page-4">Title of element on page</a></td><td class="page-number" href="#page-4"></td></tr>
</tbody></table>


<div id="page-1">Page 1</div>
<div id="page-2">Page 2</div>
<div id="page-3">Page 3</div>
<div id="page-4">Page 4</div>


</body></html>

Which is even more correct, as chrome will interpret the html the same way:
Screenshot from 2019-11-15 14-55-55

So I guess that DOMUtil.getChild(head, "bookmarks") should also look in the body as a fallback.

I think that I can provide a PR for that, @danfickle what do you think?

Thank you guys. Waiting for the next release 😊