postmodern/spidr

Is there a way to set Accept-Encoding headers?

robfuller opened this issue · 9 comments

have a site to spider - https://www.logility.com but its failing on:
ruby/2.2.0/net/http/response.rb:377:in `inflate': incorrect header check (Zlib::DataError)

If I set Accept-Encoding: plain it should work apparently (it then works via open-uri anyway).

what I did for now is modify agent.prepare_request to look at the host_headers passed and if its a hash, then use that hash to set the passed header(s), if its not leave the current functionality

unless @host_headers.empty?
    @host_headers.each do |name,header|
      if host.match(name)
        if header.is_a(Hash)
          header.each do |header_name, header_value|
            headers[header_name] = header_value
          end
        else
          headers['Host'] = header
        end
      end
    end
end

I think the polite thing for me to do is somehow propose this as a submit request? I'm not sure how to do that (took me a few minutes to figure out how to create/test a local gem to make sure this worked) - so let me know if you would like me to figure out how to submit this as a change

Going to need a little more info to isolate the root cause. I'm not sure whether it's the site or Spidr::Agent which is not following HTTP/1.1.

I've run my code on ~85 sites - only had the issue on https://www.logility.com/

Patching Agent.prepare_request with the above code, and then passing in "Accept-Encoding: plain" did work

So my guess is that the server is doing something wrong with its deflate/gzip, so telling it to not encode works. Since I can't change the server, I needed to address on the scanning side.

I tried a number of them and they all triggered it (can't say for sure its
every url, but I didn't come across any that didn't)

On Wed, Nov 18, 2015 at 8:18 PM, Postmodern notifications@github.com
wrote:

Is it every URL on logility.com or a specific URL that triggers it?


Reply to this email directly or view it on GitHub
#43 (comment).

Probably is related to ruby's default headers:

Accept-Encoding: gzip;q=1.0,deflate;q=0.6,identity;q=0.3
Accept: */*
User-Agent: Ruby

Right, by default ruby accepts gzip and deflate. Its the decompression
that is failing.

Setting the accepted encoding to only plain, means no compression.

The problem was there is no way in spidr to send the change to the header
(no way to override the ruby default) - the code I shared exposes the
ability to set that request header manually.

On Wed, Nov 18, 2015 at 8:30 PM, Postmodern notifications@github.com
wrote:

Probably is related to ruby's default headers:

Accept-Encoding: gzip;q=1.0,deflate;q=0.6,identity;q=0.3
Accept: /
User-Agent: Ruby


Reply to this email directly or view it on GitHub
#43 (comment).

I wonder if this is a bug in ruby. I may be open to adding another callback to allow setting custom headers. Although, I don't want to change too much just to workaround a bug that may be in Ruby or www.logility.com.

In searching for that error it has occurred to other libraries as well - so
its sort of a bug in ruby I guess. That said, exposing a request header
makes sense as there could be a lot of reasons you want to force a specific
set.

On Wed, Nov 18, 2015 at 8:39 PM, Postmodern notifications@github.com
wrote:

I wonder if this is a bug in ruby. I may be open to adding another
callback to allow setting custom headers. Although, I don't want to change
too much just to workaround a bug that may be in Ruby or www.logility.com.


Reply to this email directly or view it on GitHub
#43 (comment).

Spidr 0.6.0 added Agent#default_headers which is just a Hash of headers that gets added to every request. Setting agent.default_headers['Accept-Encoding'] = 'plain' or passing in default_headers: {'Accept-Encoding' => 'plain'} would fix this.