salimk/Rcrawler

Avoid big websites

Opened this issue · 0 comments

Hi Salim,

thank you for this wonderful library! It's really powerful.

I am trying to understand how if there is a way to skip websites that are too big to be crawled.
In the following example I am trying to crawl a list of websites and extract the pages with the email address but I obviously don't want the last website (The Sydney Morning Herald newspaper) to be scraped.
Is there a way to avoid this or to make the function skip if it's taking too long or the number of pages are above a given threshold?

thank you,
Ahmed

library(Rcrawler)
sample_web<-c("www.parentskills.com.au",
  "www.huggies.com.au",
  "www.babies2infinity.com.au",
  "www.smh.com.au")  

results<-lapply(sample_web, function(x){
  Rcrawler(Website = x,MaxDepth =1, saveOnDisk=FALSE,
           KeywordsFilter=c("mail","contact","about"),
           KeywordsAccuracy = 70)

    INDEX$Url<-INDEX$Url%>%
    str_extract(.,"http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+")
  myINDEX<-na.omit(INDEX$Url)
})