fetch_titles not following 301
audy opened this issue · 2 comments
audy commented
I'm using the following code to fetch titles on a site
def fetch_titles site
Enumerator.new do |enum|
Spidr.site(site) do |spider|
spider.every_html_page do |page|
enum.yield page.title
end
end
end
end
fetch_titles('http://site.tld').each do |site|
p site
end
I'm getting a lot of "301 Moved Permanently"
for page.title because Spidr is requesting http://site.tld/~page
instead of http://site.tld/~page/
.
Is there any way to tell spider to append a /
to the URI or follow 301s automatically?
postmodern commented
every_html_page
will match 301s, since they return the content-type text/html
. You should check the response status as well.
postmodern commented
Try instead:
agent.every_ok_page do |page|
if page.html?
enum.yield page.title
end
end
or
agent.every_html_page do |page|
if page.is_ok?
enum.yield page.title
end
end