Truncated HTML file after `downlit_html_path()` processing
cderv opened this issue · 3 comments
This was first reported at quarto-dev/quarto-cli#6128 and investigation lead to a downlit::downlit_html_path()
processing.
I put the full file that Quarto is post processing with downlit in this gist
If we process it, it is truncated
input <- "https://gist.githubusercontent.com/cderv/bfa2597c003065409e3473f2e36a79e3/raw/554c22e9b7fe054f7581e3a0d69e6780b284ac27/long-plotly-data.html"
html_file <- tempfile(fileext = ".html")
xfun::download_file(input, output = html_file)
#> [1] 0
content <- xfun::read_utf8(html_file)
xfun::raw_string(tail(content[nzchar(content)], 5))
#> </aside>
#> </main> <!-- /main -->
#> </div> <!-- /content -->
#> </body>
#> </html>
downlit::downlit_html_path(html_file, html_file)
content <- xfun::read_utf8(html_file)
xfun::raw_string(tail(content[nzchar(content)], 5))
#> </div>
#> </div></section></section></main>
#> </div>
#> </body>
#> </html>
unlink(html_file)
We see we loose the <aside>
.
This document is special because it contains a Plotly output with very long data in a script tag. So possible an issue with xml2 directly
So possible an issue with xml2
It seems this is an issue with xml2::read_html()
- we can see below that the reading is stopping at the cdata
part, which is the JSON data for the plotly graph.
input <- "https://gist.githubusercontent.com/cderv/bfa2597c003065409e3473f2e36a79e3/raw/554c22e9b7fe054f7581e3a0d69e6780b284ac27/long-plotly-data.html"
html_file <- tempfile(fileext = ".html")
xfun::download_file(input, output = html_file)
#> [1] 0
content <- xfun::read_utf8(html_file)
xfun::raw_string(tail(content[nzchar(content)], 5))
#> </aside>
#> </main> <!-- /main -->
#> </div> <!-- /content -->
#> </body>
#> </html>
structure_file <- tempfile(fileext = ".html")
xml2::read_html(html_file) |> xml2::xml_structure(file = structure_file)
content <- xfun::read_utf8(structure_file)
xfun::raw_string(tail(content[nzchar(content)], 10))
#> {text}
#> {text}
#> {text}
#> <div [id, class]>
#> {text}
#> <figure>
#> <div [class, id, style]>
#> {text}
#> <script [type, data-for]>
#> {cdata}
unlink(c(html_file, structure_file))
Created on 2023-07-07 with reprex v2.0.2
Ok this is in fact a matter of libxml2 options. Setting "HUGE"
as options to trigger XML_PARSE_HUGE
parser options in libxml2 to relax hardcoded limit from the parser. This what is causing the issue here.
input <- "https://gist.githubusercontent.com/cderv/bfa2597c003065409e3473f2e36a79e3/raw/554c22e9b7fe054f7581e3a0d69e6780b284ac27/long-plotly-data.html"
html_file <- tempfile(fileext = ".html")
xfun::download_file(input, output = html_file)
#> [1] 0
content <- xfun::read_utf8(html_file)
xfun::raw_string(tail(content[nzchar(content)], 5))
#> </aside>
#> </main> <!-- /main -->
#> </div> <!-- /content -->
#> </body>
#> </html>
html <- xml2::read_html(html_file, encoding = "UTF-8", options = c("RECOVER", "NOERROR", "NOBLANKS", "HUGE"))
downlit::downlit_html_node(html, classes = downlit::classes_pandoc())
xml2::write_html(html, html_file, format = FALSE)
content <- xfun::read_utf8(html_file)
xfun::raw_string(tail(content[nzchar(content)], 5))
#> <li id="fn7"><p>I used slightly different variable names, and I tried to simplify JMS’s original model.<a href="#fnref7" class="footnote-back" role="doc-backlink">↩︎</a></p></li>
#> </ol></aside></main><!-- /main -->
#> </div> <!-- /content -->
#> </body>
#> </html>
unlink(html_file)
Is setting this options by default something that would be interesting for downlit ? Or trigger the options if one of the HTML lines exceeds the limit ?
Because indeed in the document above, the line with the data is very huge !
input <- "https://gist.githubusercontent.com/cderv/bfa2597c003065409e3473f2e36a79e3/raw/554c22e9b7fe054f7581e3a0d69e6780b284ac27/long-plotly-data.html"
html_file <- tempfile(fileext = ".html")
xfun::download_file(input, output = html_file)
#> [1] 0
content <- xfun::read_utf8(html_file)
max(nchar(content))
#> [1] 12950084
unlink(html_file)
I think we could turn HUGE parsing on by default in downlit.