salimk/Rcrawler

Result data omits links not matching crawlUrlfilter filter

geotheory opened this issue · 0 comments

Thanks for this super useful package. I want to restrict the crawl to certain URL specifications, but capture all links on the crawled pages regardless of whether they match the filter. I can't get this to work in practice. An example:

Rcrawler(
  Website = "https://beta.companieshouse.gov.uk/company/02906991",
  no_cores = 4, no_conn = 4 ,
  NetworkData = TRUE, statslinks = TRUE,
  crawlUrlfilter = '02906991',
  saveOnDisk = F
)

Page https://beta.companieshouse.gov.uk/company/02906991/officers (which is crawled) includes links such as
https://beta.companieshouse.gov.uk/officers/... but these pages are not included in the results. E.g:

NetwIndex %>% str_subset('uk/officers')
character(0)

Shouldn't this links be captured, since I have provided no dataUrlfilter argument? Or am I missing something here?