read_html_live() practical implementation
Closed this issue · 1 comments
rcepka commented
Hello,
thank you for this excellent package and for the newest addition - read_html_live(). This was a very needed feature for scraping javascript based websites. I dont exactly understand how this new function works and I am trying to figure out it
s implementation into the scraping workflow.
So if it isn`t outside of scope of your regular users support, I would appreciate advices on this topics from you.
What I currently expect from my web scraping solution is mainly this:
- ability to rotate user agents
- implementation of proxy to rotate IP`s
- ability to re-run the html request in case that it fails from some reason
Below is my simplified code, the way I am doing it now:
scrape_page <- function(link, usr_agent, scraping_repeat, ...) {
sleep_time <<- runif(1, sys_sleep_time_from, sys_sleep_time_to)
Sys.sleep(sleep_time)
# Set initial values
response <- NULL
attempts <<- 1
#
# Main loop
#
while (response_code != 200 & attempts <= scraping_repeat) {
# Call this before each "GET"
proxy_number <<- get_proxy_number(proxies_list = proxies_list, proxy_selection = proxy_selection)
usr_agent <<- sample(user_agents_list, 1)
tryCatch({
response <- GET(
link,
user_agent(usr_agent),
use_proxy(
url = proxies_list$address[proxy_number],
port = as.numeric(proxies_list$port[proxy_number]),
username = proxies_list$username[proxy_number],
password = proxies_list$pass[proxy_number]
)
)
response_code <<- response$status_code
},
# Error handling
error = function(e){
logger::log_error("Fun scrape_page: The page could not be scraped, link: {link}")
}
)
# Repeat scraping if needed
if(response_code != 200) {
attempts <<- attempts + 1
wait_time <- scraping_repeat_wait_time * attempts
Sys.sleep(wait_time)
}
#
# End of main loop
#
}
return(response)
}
My questions:
- how can I implement read_html_live(), with proxy and user agent features?
- how to interact with site using read_html_live() + $click(), $scroll_to() and etc? Sorry, I am newbie here...
Many thanks in advance for any advices, hints or opinions...
hadley commented
- Ability to change user agents is tracked in #388
- It looks like using a proxy requires setting some command line flags. That's going to require quite a lot of plumbing, so is unlikely to be something I tackle until a few people have requested it.
- I'm currently not sure how we'll expose browser errors to R.