postmodern/spidr

Skip processing of pages

darkcode85 opened this issue · 1 comments

In the documentation says that is possible to skip processing some pages, but I can not find how I can do it, I have tried with ignore_links or ignore_pages but nothing sames to work, eg:

spider = Spidr.site('.....', ignore_links: [%{^/blog/}]) do |spider|
spider.every_html_page do |page|
//here I still get pages with the /blog url
end
end

How I can ignore some pages based in the URL?

ignore_links/ignore_links_like matches the full link (the String form of the URL), so your Regexp is matching against the beginning of the URL not the path. Probably something like spider.ignore_urls_like { |url| url.path.start_with?('/blog/) }. Hope that helps.