chrome_read_html

Question

chrome_read_html

Closed this issue 4 years ago · 6 comments

Thanks for this great package. Is there any function for reading html (extracting view source) like chrome_read_html in decapitated package?

Answer 1 · 2021-03-11T14:38:33.000Z

Got it. Thanks. Please find my code below -

  z <- b$Runtime$evaluate('document.documentElement.outerHTML')
  mydf <- z$result$value

Last question - can we use rvest on this? It seems it is not XML , hence not working.
mydf %>% rvest::html_nodes("[id$='_hcontainer']")

Answer 2 · 2021-03-11T14:40:58.000Z

For now crrri is rather low level and you need to create the recipe yourself.
I believe chrome_read_html() is equivalent to dumpDOM() function we gave as example in the README: https://github.com/RLesur/crrri#transpose-chrome-remote-interface-js-scripts-dump-the-dom

It uses the expression you found and that you evaluate.

The result should be HTML so rvest or xml2 can be used on this. With an example it could be easier to see the issue.

Answer 3 · 2021-03-11T14:43:15.000Z

To precise my thoughts, It feels like having these in crrri directly is not the best option to keep this package centered around Chrome Remote Interface.
But we had the idea of creating a package that would contain recipes like dumpDOM(), but we did not found the time yet to start it.

Answer 4 · 2021-03-11T14:49:33.000Z

This works ok with rvest. Here is an example:

library(promises)
library(crrri)

dump_DOM <- function(url, file = "") {
  perform_with_chrome(function(client) {
    Network <- client$Network
    Page <- client$Page
    Runtime <- client$Runtime
    Network$enable() %...>% { 
      Page$enable()
    } %...>% {
      Network$setCacheDisabled(cacheDisabled = TRUE)
    } %...>% {
      Page$navigate(url = url)
    } %...>% {
      Page$loadEventFired()
    } %...>% {
      Runtime$evaluate(
        expression = 'document.documentElement.outerHTML'
      )
    } %...>% (function(result) {
      html <- result$result$value
      cat(html, "\n", file = file)
    }) 
  })
}

html <- dump_DOM(url = "http://www.ardata.fr/post/", "test.html")
#> Running "C:\Program Files (x86)\Google\Chrome\Application\chrome.exe" \
#>   --no-first-run --headless \
#>   "--user-data-dir=C:\Users\chris\AppData\Local\r-crrri\r-crrri\chrome-data-dir-xbpnjxhj" \
#>   "--remote-debugging-port=9222" --disable-gpu --no-sandbox

library(rvest)
#> Le chargement a nécessité le package : xml2
html <- read_html("test.html")
html %>% html_node("title") %>% html_text()
#> [1] "Blog | ArData "

^{Created on 2021-03-11 by the reprex package (v1.0.0.9002)}

Answer 5 · 2021-03-11T15:21:50.000Z

Thanks. This is great. Just asking if it is possible to do it without saving as html file "test.html"

Answer 6 · 2021-03-11T16:27:04.000Z

You could decide the return value you want in the recipe.
Example:

library(promises)
library(crrri)

dump_DOM <- function(url, file = "") {
  perform_with_chrome(function(client) {
    Network <- client$Network
    Page <- client$Page
    Runtime <- client$Runtime
    Network$enable() %...>% { 
      Page$enable()
    } %...>% {
      Network$setCacheDisabled(cacheDisabled = TRUE)
    } %...>% {
      Page$navigate(url = url)
    } %...>% {
      Page$loadEventFired()
    } %...>% {
      Runtime$evaluate(
        expression = 'document.documentElement.outerHTML'
      )
    } %...>% (function(result) {
      html <- result$result$value
      rvest::read_html(html, "\n")
    }) 
  })
}

html <- dump_DOM(url = "http://www.ardata.fr/post/")
library(rvest)
html %>% html_node("title") %>% html_text()
#> [1] "Blog | ArData "

^{Created on 2021-03-11 by the reprex package (v1.0.0.9002)}

You could also return the text directly

library(promises)
library(crrri)

dump_DOM <- function(url, file = "") {
  perform_with_chrome(function(client) {
    Network <- client$Network
    Page <- client$Page
    Runtime <- client$Runtime
    Network$enable() %...>% { 
      Page$enable()
    } %...>% {
      Network$setCacheDisabled(cacheDisabled = TRUE)
    } %...>% {
      Page$navigate(url = url)
    } %...>% {
      Page$loadEventFired()
    } %...>% {
      Runtime$evaluate(
        expression = 'document.documentElement.outerHTML'
      )
    } %...>% (function(result) {
      result$result$value
    }) 
  })
}

html <- dump_DOM(url = "http://www.ardata.fr/post/")
library(rvest)
read_html(html) %>% html_node("title") %>% html_text()
#> [1] "Blog | ArData "

^{Created on 2021-03-11 by the reprex package (v1.0.0.9002)}