unable to ignore links
vanegomez opened this issue ยท 4 comments
vanegomez commented
cool gem!
I'm trying to ignore going to partners and everything after it in my site www.mysite.com/partners/resellers and is still going to those links.
root = args[:url]
url_map = Hash.new { |hash,key| hash[key] = [] }
spider = Spidr.site(root, ignore_links_like: [%{^/partners/}]) do |spider|
spider.every_url { |url| puts (url) }
spider.every_failed_url { |url| puts "Failed url #{url}" }
spider.every_link do |origin,dest|
url_map[dest] << origin
end
end
spider.failures.each do |url|
puts "Broken link #{url} found in:"
url_map[url].each { |page| puts (" #{page}").red }
end
postmodern commented
In spidr, links are the String version of the full URL. You appear to want to ignore links based on the path. Maybe something like:
spider.ignore_urls_like { |url| url.path.start_with?('/partners/') }
I should probably add ignore_paths_like
to cover that use-case.
vanegomez commented
@postmodern Thank you so much for answering.
Is it possible to follow external links and check if they are broken?
postmodern commented
You would have to explicitly call spider.get_page and check the responses, since the spider won't automatically follow off-site links.
vanegomez commented
thank you!