salimk/Rcrawler

Rcrawler skips some links for no reason

yusuzech opened this issue · 3 comments

I used Rcrawler to scrape this site: "http://www.thegreenbook.com/". I set it to only crawl one level using CSS and used url regular expression filter. But it ignores some links for no reason.

I used rvest,stringr to double check and found that 7 links are omitted.

Below is the code I used for double checking results.

library(Rcrawler)
library(rvest)
library(stringr)
library(dplyr)

url <- "http://www.thegreenbook.com/"
css = "#classificationIndex a"
filter_string = "products/search"
#using Rcrawler



Rcrawler(Website = url, 
         no_cores = 4, 
         no_conn = 4,
         ExtractCSSPat = c(css),
         MaxDepth = 1,
         urlregexfilter = c(filter_string))


length_Rcrawler <- nrow(INDEX[INDEX$Level==1,])

#using rvest  -----------------------------------------------------
#getting hrefs using the same css
hrefs <- html_session(url) %>% 
    html_nodes(css) %>% 
    html_attr("href")

hrefs_filtered <- hrefs[str_detect(hrefs,filter_string)] # filters as using `urlregexfilter`

length_rvest<- length(hrefs_filtered)

links retreived using Rcrawler and rvest are:

> length_Rcrawler
[1] 28
> length_rvest
[1] 35

Below are the links that Rcrawler omitted:

> setdiff(hrefs_filtered,INDEX[INDEX$Level==1,]$Url)
[1] "http://www.thegreenbook.com/products/search/electrical-guides/"               
[2] "http://www.thegreenbook.com/products/search/pharmaceutical-guides/"           
[3] "http://www.thegreenbook.com/products/search/office-equipment-supplies-guides/"
[4] "http://www.thegreenbook.com/products/search/garment-textile-guides/"          
[5] "http://www.thegreenbook.com/products/search/pregnancy-parenting-guides/"      
[6] "http://www.thegreenbook.com/products/search/beauty-care-guides"               
[7] "http://www.thegreenbook.com/products/search/golden-year-guides/"   

I don't know what could possibly cause this issue, as the response codes are all 200 and Stats are all finished. Also, ExtractCSSPat and urlregexfilter are correct as I have double checked using rvest. So my conclusion is that these links are just ignored.

Did I do something wrong while using the Rcrawler or is it a bug? Any help is appreciated, thanks!

Hello, we are investigating the issue, till now we are sure that all URLs can be crawled if you just use less concurrent requests inferior to 3 ( no_cores = 2, no_conn = 2). In fact concerning this website http://www.thegreenbook.com , we have noticed that when we send more than 3 requests at once, one request is returned with 403 error ( will not be collected), this may be due to server configuration which avoids bots and overloads.

capture27-100

Following this issue, we have added a warning message that appears if many sequential 403 errors occur

update version 0.1.9 is released, enjoy!
subscribe to our mailing list to stay updated http://eepurl.com/dMv_7s