rbren/rss-parser

parseURL doesn't send Accept-Encoding and fails to handle alternate Content-Encoding (e.g. gzip)

Opened this issue · 0 comments

I know of a feed hosted on a server that always returns gzipped data even when given no Accept-Encoding header, and also when given a headers such as Accept-Encoding: identity which should disable gzip compression:

% curl -s https://server/ | file -
/dev/stdin: gzip compressed data

% curl -s https://server/ -H 'Accept-Encoding: identity' | file -
/dev/stdin: gzip compressed data

% curl -s https://server/ -H 'Accept-Encoding: identity, *;q=0' | file -
/dev/stdin: gzip compressed data

This can be simulated using:

#!/usr/bin/env python3
import gzip
from http.server import BaseHTTPRequestHandler, HTTPServer
from io import BytesIO
import requests
from socketserver import ThreadingMixIn

response = requests.get("https://news.ycombinator.com/rss")
rss_content = response.content
rss_content_type = response.headers.get('Content-Type', 'text/html; charset=utf-8')

# Define request handler class
class SimpleHandler(BaseHTTPRequestHandler):
    def do_GET(self):
        if self.path == '/normal':
            self.handle_normal()
        elif self.path == '/gzip':
            self.handle_gzip()
        else:
            self.send_response(404)
            self.end_headers()
            self.wfile.write(b'Not Found')

    def handle_normal(self):
        self.send_response(200)
        self.send_header('Content-Type', rss_content_type)
        self.send_header('Content-Length', str(len(rss_content)))
        self.end_headers()
        self.wfile.write(rss_content)

    def handle_gzip(self):
        # Gzip the content
        buffer = BytesIO()
        with gzip.GzipFile(fileobj=buffer, mode='wb') as gz_file:
            gz_file.write(rss_content)
        gzipped_content = buffer.getvalue()

        # Send gzipped response
        self.send_response(200)
        self.send_header('Content-Type', rss_content_type)
        self.send_header('Content-Encoding', 'gzip')
        self.send_header('Content-Length', str(len(gzipped_content)))
        self.end_headers()
        self.wfile.write(gzipped_content)

# Define a threading mixin to handle multiple requests simultaneously
class ThreadedHTTPServer(ThreadingMixIn, HTTPServer):
    pass

# Define the server entry point
def run(server_class=ThreadedHTTPServer, handler_class=SimpleHandler, port=4444):
    server_address = ('', port)
    httpd = server_class(server_address, handler_class)
    print(f'Starting server on port {port}...')
    httpd.serve_forever()

if __name__ == '__main__':
    run()

(Hackernews isn't the guilty server, I'm just using its RSS data as a proof of concept)

As a demo, curl will emit the gzipped data unless given --compressed (which appears to cause Accept-Encoding to be sent and auto-decompresses the response)

% curl http://127.0.0.1:4444/gzip
Warning: Binary output can mess up your terminal. Use "--output -" to tell
Warning: curl to output it to your terminal anyway, or consider "--output
Warning: <FILE>" to save to a file.

% curl -s http://127.0.0.1:4444/gzip | file -
/dev/stdin: gzip compressed data, last modified: Thu Oct 10 21:05:21 2024, max compression

% curl http://127.0.0.1:4444/gzip --compressed
<rss version="2.0"> [... SNIP ...]

But fetch() will auto-decompress the data:

> await (await fetch("http://172.18.0.1:4444/gzip")).text()
'<rss version="2.0"> [... SNIP ...]

Browsing to the URL in a browser will also auto-decompress the data.

And so it seems as though user agents should automatically decompress responses if the Content-Encoding response header is sent. curl is a bit of an outlier though, in that it doesn't do it by default.

As a server, it appears to be legal to return content using gzip (or any Content-Encoding) if the client does not give an Accept-Encoding header. rfc7231 (HTTP/1.1) 5.3.4 says:

A request without an Accept-Encoding header field implies that the user agent has no preferences regarding content-codings. Although this allows the server to use any content-coding in a response, it does not imply that the user agent will be able to correctly process all encodings.

It may not be legal to return data in contradiction with an Accept-Encoding header if given. For example, if the client gives Accept-Encoding: identity or Accept-Encoding: identity, *;q=0 then it may not be legal to use gzip. And so the server that's giving me grief might not be compliant. Regardless, rss-parser doesn't send Accept-Encoding in my testing.

rss-parser's parseURL() breaks for me when a server sends gzip-compressed data:

> (await parser.parseURL("http://172.18.0.1:4444/normal")).title;
'Hacker News'

> (await parser.parseURL("http://172.18.0.1:4444/gzip")).title;
Uncaught Error: Non-whitespace before first tag.
Line: 0
Column: 1
Char:
    at error (/tmp/tmp.H1iAsG3355/node_modules/sax/lib/sax.js:658:10)
    at strictFail (/tmp/tmp.H1iAsG3355/node_modules/sax/lib/sax.js:684:7)
    at beginWhiteSpace (/tmp/tmp.H1iAsG3355/node_modules/sax/lib/sax.js:958:7)
    at SAXParser.write (/tmp/tmp.H1iAsG3355/node_modules/sax/lib/sax.js:1013:11)
    at exports.Parser.Parser.parseString (/tmp/tmp.H1iAsG3355/node_modules/xml2js/lib/parser.js:327:31)
    at Parser.parseString (/tmp/tmp.H1iAsG3355/node_modules/xml2js/lib/parser.js:5:59)
    at /tmp/tmp.H1iAsG3355/node_modules/rss-parser/lib/parser.js:33:22
    at new Promise (<anonymous>)
    at Parser.parseString (/tmp/tmp.H1iAsG3355/node_modules/rss-parser/lib/parser.js:32:16)

I think it's trying to parse the gzip-compressed data as rss+xml and is failing.

I think rss-parser should either:

  • Send Accept-Encoding: identity or Accept-Encoding: identity, *;q=0 to discourage the server from sending compressed responses. This won't help me given I'm dealing with a server that always sends gzip-compressed data regardless of the header, but it'd be the easiest way to be more compliant with what is legal - which is for servers to send responses using any encoding if none was requested.
  • Handle responses using Content-Encoding similarly to how fetch() does. That is, decompress compressed responses before parsing as rss+xml. This would cover the case in which servers do not honour an Accept-Encoding that discourages anything but an uncompressed response.

The latter change would work better for me, and may work better for others depending on the prevalence of servers that always compress responses regardless of a client's Accept-Encoding.