postmodern/spidr

fetch_titles not following 301

audy opened this issue · 2 comments

audy commented

I'm using the following code to fetch titles on a site

def fetch_titles site
  Enumerator.new do |enum|
    Spidr.site(site) do |spider|
      spider.every_html_page do |page|
        enum.yield page.title
      end
    end
  end
end


fetch_titles('http://site.tld').each do |site|
  p site
end

I'm getting a lot of "301 Moved Permanently" for page.title because Spidr is requesting http://site.tld/~page instead of http://site.tld/~page/.

Is there any way to tell spider to append a / to the URI or follow 301s automatically?

every_html_page will match 301s, since they return the content-type text/html. You should check the response status as well.

Try instead:

agent.every_ok_page do |page|
  if page.html?
    enum.yield page.title
  end
end

or

agent.every_html_page do |page|
  if page.is_ok?
    enum.yield page.title
  end
end