Avoid big websites
Opened this issue · 0 comments
athammad commented
Hi Salim,
thank you for this wonderful library! It's really powerful.
I am trying to understand how if there is a way to skip websites that are too big to be crawled.
In the following example I am trying to crawl a list of websites and extract the pages with the email address but I obviously don't want the last website (The Sydney Morning Herald newspaper) to be scraped.
Is there a way to avoid this or to make the function skip if it's taking too long or the number of pages are above a given threshold?
thank you,
Ahmed
library(Rcrawler)
sample_web<-c("www.parentskills.com.au",
"www.huggies.com.au",
"www.babies2infinity.com.au",
"www.smh.com.au")
results<-lapply(sample_web, function(x){
Rcrawler(Website = x,MaxDepth =1, saveOnDisk=FALSE,
KeywordsFilter=c("mail","contact","about"),
KeywordsAccuracy = 70)
INDEX$Url<-INDEX$Url%>%
str_extract(.,"http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+")
myINDEX<-na.omit(INDEX$Url)
})