parseURL doesn't send Accept-Encoding and fails to handle alternate Content-Encoding (e.g. gzip)
Opened this issue · 0 comments
I know of a feed hosted on a server that always returns gzipped data even when given no Accept-Encoding
header, and also when given a headers such as Accept-Encoding: identity
which should disable gzip compression:
% curl -s https://server/ | file -
/dev/stdin: gzip compressed data
% curl -s https://server/ -H 'Accept-Encoding: identity' | file -
/dev/stdin: gzip compressed data
% curl -s https://server/ -H 'Accept-Encoding: identity, *;q=0' | file -
/dev/stdin: gzip compressed data
This can be simulated using:
#!/usr/bin/env python3
import gzip
from http.server import BaseHTTPRequestHandler, HTTPServer
from io import BytesIO
import requests
from socketserver import ThreadingMixIn
response = requests.get("https://news.ycombinator.com/rss")
rss_content = response.content
rss_content_type = response.headers.get('Content-Type', 'text/html; charset=utf-8')
# Define request handler class
class SimpleHandler(BaseHTTPRequestHandler):
def do_GET(self):
if self.path == '/normal':
self.handle_normal()
elif self.path == '/gzip':
self.handle_gzip()
else:
self.send_response(404)
self.end_headers()
self.wfile.write(b'Not Found')
def handle_normal(self):
self.send_response(200)
self.send_header('Content-Type', rss_content_type)
self.send_header('Content-Length', str(len(rss_content)))
self.end_headers()
self.wfile.write(rss_content)
def handle_gzip(self):
# Gzip the content
buffer = BytesIO()
with gzip.GzipFile(fileobj=buffer, mode='wb') as gz_file:
gz_file.write(rss_content)
gzipped_content = buffer.getvalue()
# Send gzipped response
self.send_response(200)
self.send_header('Content-Type', rss_content_type)
self.send_header('Content-Encoding', 'gzip')
self.send_header('Content-Length', str(len(gzipped_content)))
self.end_headers()
self.wfile.write(gzipped_content)
# Define a threading mixin to handle multiple requests simultaneously
class ThreadedHTTPServer(ThreadingMixIn, HTTPServer):
pass
# Define the server entry point
def run(server_class=ThreadedHTTPServer, handler_class=SimpleHandler, port=4444):
server_address = ('', port)
httpd = server_class(server_address, handler_class)
print(f'Starting server on port {port}...')
httpd.serve_forever()
if __name__ == '__main__':
run()
(Hackernews isn't the guilty server, I'm just using its RSS data as a proof of concept)
As a demo, curl
will emit the gzipped data unless given --compressed
(which appears to cause Accept-Encoding
to be sent and auto-decompresses the response)
% curl http://127.0.0.1:4444/gzip
Warning: Binary output can mess up your terminal. Use "--output -" to tell
Warning: curl to output it to your terminal anyway, or consider "--output
Warning: <FILE>" to save to a file.
% curl -s http://127.0.0.1:4444/gzip | file -
/dev/stdin: gzip compressed data, last modified: Thu Oct 10 21:05:21 2024, max compression
% curl http://127.0.0.1:4444/gzip --compressed
<rss version="2.0"> [... SNIP ...]
But fetch()
will auto-decompress the data:
> await (await fetch("http://172.18.0.1:4444/gzip")).text()
'<rss version="2.0"> [... SNIP ...]
Browsing to the URL in a browser will also auto-decompress the data.
And so it seems as though user agents should automatically decompress responses if the Content-Encoding
response header is sent. curl
is a bit of an outlier though, in that it doesn't do it by default.
As a server, it appears to be legal to return content using gzip (or any Content-Encoding
) if the client does not give an Accept-Encoding
header. rfc7231 (HTTP/1.1) 5.3.4 says:
A request without an Accept-Encoding header field implies that the user agent has no preferences regarding content-codings. Although this allows the server to use any content-coding in a response, it does not imply that the user agent will be able to correctly process all encodings.
It may not be legal to return data in contradiction with an Accept-Encoding
header if given. For example, if the client gives Accept-Encoding: identity
or Accept-Encoding: identity, *;q=0
then it may not be legal to use gzip
. And so the server that's giving me grief might not be compliant. Regardless, rss-parser
doesn't send Accept-Encoding
in my testing.
rss-parser
's parseURL()
breaks for me when a server sends gzip-compressed data:
> (await parser.parseURL("http://172.18.0.1:4444/normal")).title;
'Hacker News'
> (await parser.parseURL("http://172.18.0.1:4444/gzip")).title;
Uncaught Error: Non-whitespace before first tag.
Line: 0
Column: 1
Char:
at error (/tmp/tmp.H1iAsG3355/node_modules/sax/lib/sax.js:658:10)
at strictFail (/tmp/tmp.H1iAsG3355/node_modules/sax/lib/sax.js:684:7)
at beginWhiteSpace (/tmp/tmp.H1iAsG3355/node_modules/sax/lib/sax.js:958:7)
at SAXParser.write (/tmp/tmp.H1iAsG3355/node_modules/sax/lib/sax.js:1013:11)
at exports.Parser.Parser.parseString (/tmp/tmp.H1iAsG3355/node_modules/xml2js/lib/parser.js:327:31)
at Parser.parseString (/tmp/tmp.H1iAsG3355/node_modules/xml2js/lib/parser.js:5:59)
at /tmp/tmp.H1iAsG3355/node_modules/rss-parser/lib/parser.js:33:22
at new Promise (<anonymous>)
at Parser.parseString (/tmp/tmp.H1iAsG3355/node_modules/rss-parser/lib/parser.js:32:16)
I think it's trying to parse the gzip-compressed data as rss+xml and is failing.
I think rss-parser should either:
- Send
Accept-Encoding: identity
orAccept-Encoding: identity, *;q=0
to discourage the server from sending compressed responses. This won't help me given I'm dealing with a server that always sends gzip-compressed data regardless of the header, but it'd be the easiest way to be more compliant with what is legal - which is for servers to send responses using any encoding if none was requested. - Handle responses using
Content-Encoding
similarly to howfetch()
does. That is, decompress compressed responses before parsing as rss+xml. This would cover the case in which servers do not honour anAccept-Encoding
that discourages anything but an uncompressed response.
The latter change would work better for me, and may work better for others depending on the prevalence of servers that always compress responses regardless of a client's Accept-Encoding
.