`ignore_links` not working.
vwochnik opened this issue · 4 comments
Hello. I am loving this library! But I have an issue.
I am collecting URLS of already scraped pages inside an array and for later continuation of the process am using ignore_links
to skip them.
However, it's not working. The URLs are collected by page.url
and are then fed into ignore_links
later on as absolute URL strings. The page I am scraping references it's content by relative links.
linkregs = [] # regexes, working fine
ignore = [] # read from file
Spidr.start_at("http://example.com", links: linkregs, ignore_links: ignore) do |spidr|
spidr.every_page do |page|
if ignore.include?(page.url.to_s)
# this is the problem
puts "Error!!"
end
ignore.push(page.url.to_s)
end
end
# save ignore to file
Fixed by not using ignore_links
anymore and instead using a single Proc
for the links
rule and thereby having my custom logic decide whether or not to crawl the link. Should be mentioned somewhere though that once an accept
rule is truthy, all reject
rules are being ignored.
@vwochnik do you think spidr should check the reject
rules as well, even when one of the accept
rules matches?
I don't know how this should work. The edge case that both an accept and a reject rule apply can either be true or false depending on whether the user wants to give the accept or reject rule priority. So either prioritize the rules or make a setting whether to prefer accept or reject rules.