postmodern/spidr

`ignore_links` not working.

vwochnik opened this issue · 4 comments

Hello. I am loving this library! But I have an issue.

I am collecting URLS of already scraped pages inside an array and for later continuation of the process am using ignore_links to skip them.

However, it's not working. The URLs are collected by page.url and are then fed into ignore_links later on as absolute URL strings. The page I am scraping references it's content by relative links.

linkregs = [] # regexes, working fine
ignore = [] # read from file
Spidr.start_at("http://example.com", links: linkregs, ignore_links: ignore) do |spidr|
  spidr.every_page do |page|
    if ignore.include?(page.url.to_s)
      # this is the problem
      puts "Error!!"
    end
    ignore.push(page.url.to_s)
  end
end
# save ignore to file

Fixed by not using ignore_links anymore and instead using a single Proc for the links rule and thereby having my custom logic decide whether or not to crawl the link. Should be mentioned somewhere though that once an accept rule is truthy, all reject rules are being ignored.

@vwochnik do you think spidr should check the reject rules as well, even when one of the accept rules matches?

I don't know how this should work. The edge case that both an accept and a reject rule apply can either be true or false depending on whether the user wants to give the accept or reject rule priority. So either prioritize the rules or make a setting whether to prefer accept or reject rules.

@vwochnik the relevant code:

spidr/lib/spidr/rules.rb

Lines 41 to 47 in 44fa099

def accept?(data)
unless @accept.empty?
@accept.any? { |rule| test_data(data,rule) }
else
!@reject.any? { |rule| test_data(data,rule) }
end
end