felipecsl/wombat

400 Bad Request on some websites.

Opened this issue · 2 comments

Hello,
I noticed some strange behaviour of Wombat. Let's say I want to crawl 2 websites firstly I was using Typhoeus and Regex to crawl websites, but there was one website which constantly was giving me 302 and then i found Wombat but now the interesting thing is that when I use wombat for it it works perfectly, but when I try wombat on the other website i get an error which is

/.rvm/gems/ruby-2.1.5/gems/mechanize-2.7.3/lib/mechanize/http/agent.rb:308:in `fetch': 400 => Net::HTTPBadRequest for "THE_WEBSITE_URL" -- unhandled response (Mechanize::ResponseCodeError)

And the URL is correct ... I tried it in the browser and it worked. So can somebody help me with this one.. Also I don't have puts in front of Wombat.crawl do ... because I saw this also as a problem.
Thank you in advance and sorry for my english!

Can you share the exact URL that is causing the problem?
Under the hood, Wombat is using Mechanize to request the page, so it could be either a Mechanize bug or a misconfiguration

So here is the full response:

/Users/IvoDukov/.rvm/gems/ruby-2.1.5/gems/mechanize-2.7.3/lib/mechanize/http/agent.rb:308:in `fetch': 400 => Net::HTTPBadRequest for *the_url* -- unhandled response (Mechanize::ResponseCodeError)
        from /Users/IvoDukov/.rvm/gems/ruby-2.1.5/gems/mechanize-2.7.3/lib/mechanize/http/agent.rb:976:in `response_redirect'
        from /Users/IvoDukov/.rvm/gems/ruby-2.1.5/gems/mechanize-2.7.3/lib/mechanize/http/agent.rb:300:in `fetch'
        from /Users/IvoDukov/.rvm/gems/ruby-2.1.5/gems/mechanize-2.7.3/lib/mechanize.rb:440:in `get'
        from /Users/IvoDukov/.rvm/gems/ruby-2.1.5/gems/wombat-2.3.0/lib/wombat/processing/parser.rb:47:in `parser_for'
        from /Users/IvoDukov/.rvm/gems/ruby-2.1.5/gems/wombat-2.3.0/lib/wombat/processing/parser.rb:33:in `parse'
        from /Users/IvoDukov/.rvm/gems/ruby-2.1.5/gems/wombat-2.3.0/lib/wombat/crawler.rb:30:in `crawl'
        from websites/net-a-porter/link_crawler.rb:78:in `<main>'

And here is my code:

class LinksCrawler
  include Wombat::Crawler
  base_url website_base_url
  path category_path

  links({:xpath => '//div[@class="description"]/a[contains(@href, "product")]/@href'}, :list)
end

link_crawler = LinksCrawler.new
link_crawler.crawl

I don't want to share the exact url because of security purposes, but I can tell you that if you paste it in the browser it works for sure.