fetch_titles not following 301

Question

fetch_titles not following 301

audy opened this issue 10 years ago · 2 comments

I'm using the following code to fetch titles on a site

def fetch_titles site
  Enumerator.new do |enum|
    Spidr.site(site) do |spider|
      spider.every_html_page do |page|
        enum.yield page.title
      end
    end
  end
end


fetch_titles('http://site.tld').each do |site|
  p site
end

I'm getting a lot of "301 Moved Permanently" for page.title because Spidr is requesting http://site.tld/~page instead of http://site.tld/~page/.

Is there any way to tell spider to append a / to the URI or follow 301s automatically?

Answer 1 · 2014-12-12T21:04:16.000Z

every_html_page will match 301s, since they return the content-type text/html. You should check the response status as well.

Answer 2 · 2022-01-29T02:37:34.000Z

Try instead:

agent.every_ok_page do |page|
  if page.html?
    enum.yield page.title
  end
end

or

agent.every_html_page do |page|
  if page.is_ok?
    enum.yield page.title
  end
end