Result data omits links not matching crawlUrlfilter filter
geotheory opened this issue · 0 comments
geotheory commented
Thanks for this super useful package. I want to restrict the crawl to certain URL specifications, but capture all links on the crawled pages regardless of whether they match the filter. I can't get this to work in practice. An example:
Rcrawler(
Website = "https://beta.companieshouse.gov.uk/company/02906991",
no_cores = 4, no_conn = 4 ,
NetworkData = TRUE, statslinks = TRUE,
crawlUrlfilter = '02906991',
saveOnDisk = F
)
Page https://beta.companieshouse.gov.uk/company/02906991/officers
(which is crawled) includes links such as
https://beta.companieshouse.gov.uk/officers/...
but these pages are not included in the results. E.g:
NetwIndex %>% str_subset('uk/officers')
character(0)
Shouldn't this links be captured, since I have provided no dataUrlfilter
argument? Or am I missing something here?