crawlUrlfilter
mustaszewski opened this issue · 0 comments
Thank you for developing this very useful package. However, I have a problem with the crawlUrlfilter
argument.
From a large website, I would like to crawl and scrape only those URLs that match a specific pattern. According to the documentation, the crawlUrlfilter
does exactly what I am looking for.
When the pattern passed to crawlUrlfilter
contains only one level of the URL, like in the following code
Rcrawler(Website = "https://www.somewebsite.org/", crawlUrlfilter = "/article/")
I get the desired results, i.e. only those URLS that match the pattern "article", e.g.
https://www.somewebsite.org/article/sample-article-217 or
https://www.somewebsite.org/article/2019-01-20-another-example
However, when I want to filter URLs based on a pattern of two levels of the URL, such as:
https://www.somewebsite.org/article/news/january-2019-meeting_of_trainers or
https://www.somewebsite.org/article/news/review-of-meetup
the following code does not find any matches:
Rcrawler(Website = "https://www.somewebsite.org/", crawlUrlfilter = "/article/news")
Is this a bug, or am I getting something wrong?
Following the example given in the documentation dataUrlfilter ="/[0-9]{4}/[0-9]{2}/[0-9]{2}/"
it should be no problem at all passing an argument that contains several "/".