chrome_read_html
Closed this issue · 6 comments
Thanks for this great package. Is there any function for reading html (extracting view source) like chrome_read_html
in decapitated package?
Got it. Thanks. Please find my code below -
z <- b$Runtime$evaluate('document.documentElement.outerHTML')
mydf <- z$result$value
Last question - can we use rvest on this? It seems it is not XML , hence not working.
mydf %>% rvest::html_nodes("[id$='_hcontainer']")
For now crrri is rather low level and you need to create the recipe yourself.
I believe chrome_read_html()
is equivalent to dumpDOM()
function we gave as example in the README: https://github.com/RLesur/crrri#transpose-chrome-remote-interface-js-scripts-dump-the-dom
It uses the expression you found and that you evaluate.
The result should be HTML so rvest or xml2 can be used on this. With an example it could be easier to see the issue.
To precise my thoughts, It feels like having these in crrri directly is not the best option to keep this package centered around Chrome Remote Interface.
But we had the idea of creating a package that would contain recipes like dumpDOM()
, but we did not found the time yet to start it.
This works ok with rvest. Here is an example:
library(promises)
library(crrri)
dump_DOM <- function(url, file = "") {
perform_with_chrome(function(client) {
Network <- client$Network
Page <- client$Page
Runtime <- client$Runtime
Network$enable() %...>% {
Page$enable()
} %...>% {
Network$setCacheDisabled(cacheDisabled = TRUE)
} %...>% {
Page$navigate(url = url)
} %...>% {
Page$loadEventFired()
} %...>% {
Runtime$evaluate(
expression = 'document.documentElement.outerHTML'
)
} %...>% (function(result) {
html <- result$result$value
cat(html, "\n", file = file)
})
})
}
html <- dump_DOM(url = "http://www.ardata.fr/post/", "test.html")
#> Running "C:\Program Files (x86)\Google\Chrome\Application\chrome.exe" \
#> --no-first-run --headless \
#> "--user-data-dir=C:\Users\chris\AppData\Local\r-crrri\r-crrri\chrome-data-dir-xbpnjxhj" \
#> "--remote-debugging-port=9222" --disable-gpu --no-sandbox
library(rvest)
#> Le chargement a nécessité le package : xml2
html <- read_html("test.html")
html %>% html_node("title") %>% html_text()
#> [1] "Blog | ArData "
Created on 2021-03-11 by the reprex package (v1.0.0.9002)
Thanks. This is great. Just asking if it is possible to do it without saving as html file "test.html"
You could decide the return value you want in the recipe.
Example:
library(promises)
library(crrri)
dump_DOM <- function(url, file = "") {
perform_with_chrome(function(client) {
Network <- client$Network
Page <- client$Page
Runtime <- client$Runtime
Network$enable() %...>% {
Page$enable()
} %...>% {
Network$setCacheDisabled(cacheDisabled = TRUE)
} %...>% {
Page$navigate(url = url)
} %...>% {
Page$loadEventFired()
} %...>% {
Runtime$evaluate(
expression = 'document.documentElement.outerHTML'
)
} %...>% (function(result) {
html <- result$result$value
rvest::read_html(html, "\n")
})
})
}
html <- dump_DOM(url = "http://www.ardata.fr/post/")
library(rvest)
html %>% html_node("title") %>% html_text()
#> [1] "Blog | ArData "
Created on 2021-03-11 by the reprex package (v1.0.0.9002)
You could also return the text directly
library(promises)
library(crrri)
dump_DOM <- function(url, file = "") {
perform_with_chrome(function(client) {
Network <- client$Network
Page <- client$Page
Runtime <- client$Runtime
Network$enable() %...>% {
Page$enable()
} %...>% {
Network$setCacheDisabled(cacheDisabled = TRUE)
} %...>% {
Page$navigate(url = url)
} %...>% {
Page$loadEventFired()
} %...>% {
Runtime$evaluate(
expression = 'document.documentElement.outerHTML'
)
} %...>% (function(result) {
result$result$value
})
})
}
html <- dump_DOM(url = "http://www.ardata.fr/post/")
library(rvest)
read_html(html) %>% html_node("title") %>% html_text()
#> [1] "Blog | ArData "
Created on 2021-03-11 by the reprex package (v1.0.0.9002)