postmodern/spidr

expected absolute path component: sites/ftp.apache.org/

ethicalhack3r opened this issue · 4 comments

Get this error while spidering http://apache.org.

The URL that breaks the spider seems to be this one:
www.mirrorservice.org/sites/ftp.apache.org/ (looks although theres a domain when infact its a path)

/usr/lib/ruby/1.8/uri/generic.rb:475:in check_path': bad component(expected absolute path component): sites/ftp.apache.org/ (URI::InvalidComponentError) from /usr/lib/ruby/1.8/uri/generic.rb:495:inpath='
from /var/lib/gems/1.8/gems/spidr-0.2.7/lib/spidr/page.rb:537:in
to_absolute' from /var/lib/gems/1.8/gems/spidr-0.2.7/lib/spidr/page.rb:514:inurls'
from /var/lib/gems/1.8/gems/spidr-0.2.7/lib/spidr/page.rb:514:in map' from /var/lib/gems/1.8/gems/spidr-0.2.7/lib/spidr/page.rb:514:inurls'
from /var/lib/gems/1.8/gems/spidr-0.2.7/lib/spidr/agent.rb:587:in
visit_page' from /var/lib/gems/1.8/gems/spidr-0.2.7/lib/spidr/agent.rb:513:in get_page'
from /var/lib/gems/1.8/gems/spidr-0.2.7/lib/spidr/agent.rb:678:in
prepare_request' from /var/lib/gems/1.8/gems/spidr-0.2.7/lib/spidr/agent.rb:507:in get_page'
from /var/lib/gems/1.8/gems/spidr-0.2.7/lib/spidr/agent.rb:573:in
visit_page' from /var/lib/gems/1.8/gems/spidr-0.2.7/lib/spidr/agent.rb:244:inrun'
from /var/lib/gems/1.8/gems/spidr-0.2.7/lib/spidr/agent.rb:226:in
start_at' from /var/lib/gems/1.8/gems/spidr-0.2.7/lib/spidr/agent.rb:197:insite'
from /var/lib/gems/1.8/gems/spidr-0.2.7/lib/spidr/agent.rb:124:in
initialize' from /var/lib/gems/1.8/gems/spidr-0.2.7/lib/spidr/agent.rb:194:innew'
from /var/lib/gems/1.8/gems/spidr-0.2.7/lib/spidr/agent.rb:194:in site' from /var/lib/gems/1.8/gems/spidr-0.2.7/lib/spidr/spidr.rb:96:insite'

I will do some further investigating and see if I can come up with a fix.

ryan

So this is somehow related to URI.expand_path, and somehow a relative path is coming from the merged URI. Can you find the exact page this bug is triggered on, I want to reproduce this bug in a spec test.

I successfully spidered all of apache.org, so this was probably fixed by 7b57cf0. I will quickly release 0.3.0 which should fix this.

Oops, spoke too soon. This is a Ruby 1.8 specific bug.

OK this is a difference in behavor between URI::FTP#path= in 1.8.7 and 1.9.2:

1.9.2

url = URI('ftp://foo.bar/baz')
url.path
# => "baz"
url.path = "bax"
url.path
# => "bax"
url.path = "/quix"
url.path
# => "/quix"
url
# => #<URI::FTP:0x000000033a77f0 URL:ftp://foo.bar/%2Fquix> 

1.8.7

url = URI('ftp://foo.bar/baz')
url.path
# => "baz"
url.path = "bax"
URI::InvalidComponentError: bad component(expected absolute path component): bax
from /home/hal/.rvm/rubies/ruby-1.8.7-p334/lib/ruby/1.8/uri/generic.rb:475:in `check_path'
from /home/hal/.rvm/rubies/ruby-1.8.7-p334/lib/ruby/1.8/uri/generic.rb:495:in `path='
from (irb):5
url.path = "/quix"
url.path
# => "/quix"
url
# => #<URI::FTP:0x7f2e1e98a5f0 URL:ftp://foo.bar/quix>