RLesur/crrri

chrome_read_html

Closed this issue · 6 comments

Thanks for this great package. Is there any function for reading html (extracting view source) like chrome_read_html in decapitated package?

Got it. Thanks. Please find my code below -

  z <- b$Runtime$evaluate('document.documentElement.outerHTML')
  mydf <- z$result$value

Last question - can we use rvest on this? It seems it is not XML , hence not working.
mydf %>% rvest::html_nodes("[id$='_hcontainer']")

cderv commented

For now crrri is rather low level and you need to create the recipe yourself.
I believe chrome_read_html() is equivalent to dumpDOM() function we gave as example in the README: https://github.com/RLesur/crrri#transpose-chrome-remote-interface-js-scripts-dump-the-dom

It uses the expression you found and that you evaluate.

The result should be HTML so rvest or xml2 can be used on this. With an example it could be easier to see the issue.

cderv commented

To precise my thoughts, It feels like having these in crrri directly is not the best option to keep this package centered around Chrome Remote Interface.
But we had the idea of creating a package that would contain recipes like dumpDOM(), but we did not found the time yet to start it.

cderv commented

This works ok with rvest. Here is an example:

library(promises)
library(crrri)

dump_DOM <- function(url, file = "") {
  perform_with_chrome(function(client) {
    Network <- client$Network
    Page <- client$Page
    Runtime <- client$Runtime
    Network$enable() %...>% { 
      Page$enable()
    } %...>% {
      Network$setCacheDisabled(cacheDisabled = TRUE)
    } %...>% {
      Page$navigate(url = url)
    } %...>% {
      Page$loadEventFired()
    } %...>% {
      Runtime$evaluate(
        expression = 'document.documentElement.outerHTML'
      )
    } %...>% (function(result) {
      html <- result$result$value
      cat(html, "\n", file = file)
    }) 
  })
}

html <- dump_DOM(url = "http://www.ardata.fr/post/", "test.html")
#> Running "C:\Program Files (x86)\Google\Chrome\Application\chrome.exe" \
#>   --no-first-run --headless \
#>   "--user-data-dir=C:\Users\chris\AppData\Local\r-crrri\r-crrri\chrome-data-dir-xbpnjxhj" \
#>   "--remote-debugging-port=9222" --disable-gpu --no-sandbox

library(rvest)
#> Le chargement a nécessité le package : xml2
html <- read_html("test.html")
html %>% html_node("title") %>% html_text()
#> [1] "Blog | ArData "

Created on 2021-03-11 by the reprex package (v1.0.0.9002)

Thanks. This is great. Just asking if it is possible to do it without saving as html file "test.html"

cderv commented

You could decide the return value you want in the recipe.
Example:

library(promises)
library(crrri)

dump_DOM <- function(url, file = "") {
  perform_with_chrome(function(client) {
    Network <- client$Network
    Page <- client$Page
    Runtime <- client$Runtime
    Network$enable() %...>% { 
      Page$enable()
    } %...>% {
      Network$setCacheDisabled(cacheDisabled = TRUE)
    } %...>% {
      Page$navigate(url = url)
    } %...>% {
      Page$loadEventFired()
    } %...>% {
      Runtime$evaluate(
        expression = 'document.documentElement.outerHTML'
      )
    } %...>% (function(result) {
      html <- result$result$value
      rvest::read_html(html, "\n")
    }) 
  })
}

html <- dump_DOM(url = "http://www.ardata.fr/post/")
library(rvest)
html %>% html_node("title") %>% html_text()
#> [1] "Blog | ArData "

Created on 2021-03-11 by the reprex package (v1.0.0.9002)

You could also return the text directly

library(promises)
library(crrri)

dump_DOM <- function(url, file = "") {
  perform_with_chrome(function(client) {
    Network <- client$Network
    Page <- client$Page
    Runtime <- client$Runtime
    Network$enable() %...>% { 
      Page$enable()
    } %...>% {
      Network$setCacheDisabled(cacheDisabled = TRUE)
    } %...>% {
      Page$navigate(url = url)
    } %...>% {
      Page$loadEventFired()
    } %...>% {
      Runtime$evaluate(
        expression = 'document.documentElement.outerHTML'
      )
    } %...>% (function(result) {
      result$result$value
    }) 
  })
}

html <- dump_DOM(url = "http://www.ardata.fr/post/")
library(rvest)
read_html(html) %>% html_node("title") %>% html_text()
#> [1] "Blog | ArData "

Created on 2021-03-11 by the reprex package (v1.0.0.9002)