r-lib/downlit

Truncated HTML file after `downlit_html_path()` processing

cderv opened this issue · 3 comments

cderv commented

This was first reported at quarto-dev/quarto-cli#6128 and investigation lead to a downlit::downlit_html_path() processing.

I put the full file that Quarto is post processing with downlit in this gist

If we process it, it is truncated

input <- "https://gist.githubusercontent.com/cderv/bfa2597c003065409e3473f2e36a79e3/raw/554c22e9b7fe054f7581e3a0d69e6780b284ac27/long-plotly-data.html"
html_file <- tempfile(fileext = ".html")
xfun::download_file(input, output = html_file)
#> [1] 0

content <- xfun::read_utf8(html_file)
xfun::raw_string(tail(content[nzchar(content)], 5))
#> </aside>
#> </main> <!-- /main -->
#> </div> <!-- /content -->
#> </body>
#> </html>

downlit::downlit_html_path(html_file, html_file)

content <- xfun::read_utf8(html_file)
xfun::raw_string(tail(content[nzchar(content)], 5))
#> </div>
#> </div></section></section></main>
#> </div>
#> </body>
#> </html>

unlink(html_file)

We see we loose the <aside>.

This document is special because it contains a Plotly output with very long data in a script tag. So possible an issue with xml2 directly

cderv commented

So possible an issue with xml2

It seems this is an issue with xml2::read_html() - we can see below that the reading is stopping at the cdata part, which is the JSON data for the plotly graph.

input <- "https://gist.githubusercontent.com/cderv/bfa2597c003065409e3473f2e36a79e3/raw/554c22e9b7fe054f7581e3a0d69e6780b284ac27/long-plotly-data.html"
html_file <- tempfile(fileext = ".html")
xfun::download_file(input, output = html_file)
#> [1] 0

content <- xfun::read_utf8(html_file)
xfun::raw_string(tail(content[nzchar(content)], 5))
#> </aside>
#> </main> <!-- /main -->
#> </div> <!-- /content -->
#> </body>
#> </html>

structure_file <- tempfile(fileext = ".html")
xml2::read_html(html_file) |> xml2::xml_structure(file = structure_file)
content <- xfun::read_utf8(structure_file)
xfun::raw_string(tail(content[nzchar(content)], 10))
#>                           {text}
#>                         {text}
#>                 {text}
#>               <div [id, class]>
#>                 {text}
#>                 <figure>
#>                   <div [class, id, style]>
#>                   {text}
#>                   <script [type, data-for]>
#>                     {cdata}

unlink(c(html_file, structure_file))

Created on 2023-07-07 with reprex v2.0.2

cderv commented

Ok this is in fact a matter of libxml2 options. Setting "HUGE" as options to trigger XML_PARSE_HUGE parser options in libxml2 to relax hardcoded limit from the parser. This what is causing the issue here.

input <- "https://gist.githubusercontent.com/cderv/bfa2597c003065409e3473f2e36a79e3/raw/554c22e9b7fe054f7581e3a0d69e6780b284ac27/long-plotly-data.html"
html_file <- tempfile(fileext = ".html")
xfun::download_file(input, output = html_file)
#> [1] 0

content <- xfun::read_utf8(html_file)
xfun::raw_string(tail(content[nzchar(content)], 5))
#> </aside>
#> </main> <!-- /main -->
#> </div> <!-- /content -->
#> </body>
#> </html>

html <- xml2::read_html(html_file, encoding = "UTF-8", options = c("RECOVER", "NOERROR", "NOBLANKS", "HUGE"))
downlit::downlit_html_node(html, classes = downlit::classes_pandoc())
xml2::write_html(html, html_file, format = FALSE)

content <- xfun::read_utf8(html_file)
xfun::raw_string(tail(content[nzchar(content)], 5))
#> <li id="fn7"><p>I used slightly different variable names, and I tried to simplify JMS’s original model.<a href="#fnref7" class="footnote-back" role="doc-backlink">↩︎</a></p></li>
#> </ol></aside></main><!-- /main -->
#> </div> <!-- /content -->
#> </body>
#> </html>

unlink(html_file)

Is setting this options by default something that would be interesting for downlit ? Or trigger the options if one of the HTML lines exceeds the limit ?

Because indeed in the document above, the line with the data is very huge !

input <- "https://gist.githubusercontent.com/cderv/bfa2597c003065409e3473f2e36a79e3/raw/554c22e9b7fe054f7581e3a0d69e6780b284ac27/long-plotly-data.html"
html_file <- tempfile(fileext = ".html")
xfun::download_file(input, output = html_file)
#> [1] 0

content <- xfun::read_utf8(html_file)

max(nchar(content))
#> [1] 12950084

unlink(html_file)
hadley commented

I think we could turn HUGE parsing on by default in downlit.