tidyverse/rvest

read_html_live() practical implementation

Closed this issue · 1 comments

rcepka commented

Hello,
thank you for this excellent package and for the newest addition - read_html_live(). This was a very needed feature for scraping javascript based websites. I dont exactly understand how this new function works and I am trying to figure out its implementation into the scraping workflow.
So if it isn`t outside of scope of your regular users support, I would appreciate advices on this topics from you.

What I currently expect from my web scraping solution is mainly this:

  • ability to rotate user agents
  • implementation of proxy to rotate IP`s
  • ability to re-run the html request in case that it fails from some reason

Below is my simplified code, the way I am doing it now:

scrape_page <- function(link, usr_agent, scraping_repeat, ...) {


    sleep_time <<- runif(1, sys_sleep_time_from, sys_sleep_time_to)
    Sys.sleep(sleep_time)


    # Set initial values
    response <- NULL
    attempts <<- 1


    #
    # Main loop
    #

    while (response_code != 200  &  attempts <= scraping_repeat) {

      # Call this before each "GET"
      proxy_number <<- get_proxy_number(proxies_list = proxies_list, proxy_selection = proxy_selection)
      usr_agent <<- sample(user_agents_list, 1)


      tryCatch({
        response <- GET(
          link,
          user_agent(usr_agent),
          use_proxy(
            url = proxies_list$address[proxy_number],
            port = as.numeric(proxies_list$port[proxy_number]),
            username = proxies_list$username[proxy_number],
            password = proxies_list$pass[proxy_number]
          )
        )

        response_code <<- response$status_code

      },
      # Error handling
      error = function(e){
        logger::log_error("Fun scrape_page:  The page could not be scraped, link: {link}")
      }
      )

    # Repeat scraping if needed
      if(response_code != 200) {
        attempts <<- attempts + 1
        wait_time <- scraping_repeat_wait_time * attempts
        Sys.sleep(wait_time)
        }

    #
    # End of main loop
    #
    }

    return(response)

  }

My questions:

  • how can I implement read_html_live(), with proxy and user agent features?
  • how to interact with site using read_html_live() + $click(), $scroll_to() and etc? Sorry, I am newbie here...

Many thanks in advance for any advices, hints or opinions...

  1. Ability to change user agents is tracked in #388
  2. It looks like using a proxy requires setting some command line flags. That's going to require quite a lot of plumbing, so is unlikely to be something I tackle until a few people have requested it.
  3. I'm currently not sure how we'll expose browser errors to R.