Rcrawler skips some links for no reason

Question

Rcrawler skips some links for no reason

yusuzech opened this issue 6 years ago · 3 comments

I used Rcrawler to scrape this site: "http://www.thegreenbook.com/". I set it to only crawl one level using CSS and used url regular expression filter. But it ignores some links for no reason.

I used rvest,stringr to double check and found that 7 links are omitted.

Below is the code I used for double checking results.

library(Rcrawler)
library(rvest)
library(stringr)
library(dplyr)

url <- "http://www.thegreenbook.com/"
css = "#classificationIndex a"
filter_string = "products/search"
#using Rcrawler



Rcrawler(Website = url, 
         no_cores = 4, 
         no_conn = 4,
         ExtractCSSPat = c(css),
         MaxDepth = 1,
         urlregexfilter = c(filter_string))


length_Rcrawler <- nrow(INDEX[INDEX$Level==1,])

#using rvest  -----------------------------------------------------
#getting hrefs using the same css
hrefs <- html_session(url) %>% 
    html_nodes(css) %>% 
    html_attr("href")

hrefs_filtered <- hrefs[str_detect(hrefs,filter_string)] # filters as using `urlregexfilter`

length_rvest<- length(hrefs_filtered)

links retreived using Rcrawler and rvest are:

> length_Rcrawler
[1] 28
> length_rvest
[1] 35

Below are the links that Rcrawler omitted:

> setdiff(hrefs_filtered,INDEX[INDEX$Level==1,]$Url)
[1] "http://www.thegreenbook.com/products/search/electrical-guides/"               
[2] "http://www.thegreenbook.com/products/search/pharmaceutical-guides/"           
[3] "http://www.thegreenbook.com/products/search/office-equipment-supplies-guides/"
[4] "http://www.thegreenbook.com/products/search/garment-textile-guides/"          
[5] "http://www.thegreenbook.com/products/search/pregnancy-parenting-guides/"      
[6] "http://www.thegreenbook.com/products/search/beauty-care-guides"               
[7] "http://www.thegreenbook.com/products/search/golden-year-guides/"

I don't know what could possibly cause this issue, as the response codes are all 200 and Stats are all finished. Also, ExtractCSSPat and urlregexfilter are correct as I have double checked using rvest. So my conclusion is that these links are just ignored.

Did I do something wrong while using the Rcrawler or is it a bug? Any help is appreciated, thanks!

Answer 1 · 2018-10-27T00:10:43.000Z

Hello, we are investigating the issue, till now we are sure that all URLs can be crawled if you just use less concurrent requests inferior to 3 ( no_cores = 2, no_conn = 2). In fact concerning this website http://www.thegreenbook.com , we have noticed that when we send more than 3 requests at once, one request is returned with 403 error ( will not be collected), this may be due to server configuration which avoids bots and overloads.

Answer 2 · 2018-10-27T21:48:52.000Z

Following this issue, we have added a warning message that appears if many sequential 403 errors occur

Answer 3 · 2018-11-11T22:52:21.000Z

update version 0.1.9 is released, enjoy!
subscribe to our mailing list to stay updated http://eepurl.com/dMv_7s