tokers/zstd-nginx-module

Using a Dictionary is not Standard-Compliant

felixhandte opened this issue · 6 comments

Hi! Thanks so much for writing this module! We're very excited to see the community build tooling to help drive adoption of Zstd as an encoding for HTTP.

Are you running this module on a server somewhere that we can point to as a reference server for clients to test against?

I want to point out an issue with the zstd_dict_file directive you've added. Certainly dictionaries are a very exciting technology (that I spend most of my time working on and around). And we're actively working to figure out how to bring dictionary-based compression to the web.

However, while zstd the reference implementation has extensive support for dictionaries, zstd the HTTP content-coding, as specified by RFC8478, does not. Now technically... reading the RFC very closely, it seems like we may not have explicitly disallowed the use of a dictionary under the zstd encoding. But because the content-coding registration only specifies a means to signal the use of the zstd format, and does not additionally specify any mechanism for advertising/negotiating/synchronizing the use of a specific dictionary between client and server (other than the dictionary id that is optionally written into the frame header), the content encoding as specified is not sufficient for two implementations, lacking any external synchronization, to be able to use dictionaries together.

Which is to say, use of dictionary-based compression with just the Content-Encoding: zstd header is dangerous and almost certainly wrong. So in order to preserve interoperability with other implementations of the zstd content-coding, I would suggest making at least one of these changes:

  • Add warning text to the description of the zstd_dict_file directive.
  • Suggest or require that use of zstd_dict_file also require changing the content-coding identifier this extension accepts/produces (e.g., "x-zstd-dict-<DICTID>").

Thanks!

Are you running this module on a server somewhere that we can point to as a reference server for clients to test against?

Not yet. I just did some tests in my own machine. I'm going to add some test cases and enable the CI/CD for this module. Then I will deploy a experimental Nginx server (as a proxy server, for example) built with this module and publish a domain/IP :).

Which is to say, use of dictionary-based compression with just the Content-Encoding: zstd header is dangerous and almost certainly wrong. So in order to preserve interoperability with other implementations of the zstd content-coding, I would suggest making at least one of these changes:

  • Add warning text to the description of the zstd_dict_file directive.

Fair enough. I will make a Pull Request to modify the README.md.

  • Suggest or require that use of zstd_dict_file also require changing the content-coding identifier this > extension accepts/produces (e.g., "x-zstd-dict-").

IMHO, I don't see any documents indicate that the Content-Encoding can be used this way. Does this also contained in the HTTP/1.1 specification?

@fengidri What do you think of this?

The first thing to solve is how Accept-Encoding is brought to the server. For web services, it always starts with the client. The key is to negotiate this.

@felixhandte
I have pushed a commit 881dc27, which adds a warning about the risk of using the zstd_dict_file directive.

BTW. What I want to say is, will ZStardand public a "common dictionary"? so we can ask all the servers/clients use this , just as a specification (using for HTTP encoding/decoding). Just like the HPACK specification in HTTP/2. Of course there are some significant differences, after all the HPACK is used for compressing the HTTP headers while zstd is used for compressing the source itself.

Thanks!

@tokers, in short, yes. Though we anticipate standardizing a set of dictionaries, rather than a single one (e.g., one for CSS, one for JS, one for HTML, etc.). Note that when we do so, we will specify a new content coding identifier for zstd + static-dict traffic, rather than try to backport it into the zstd content-coding.

@fengidri, yes. If the client and server have pre-negotiated the use of a specific dictionary, I am suggesting that they also agree to use some other content-coding identifier, since their traffic will not be interoperable with implementations that conform to zstd as standardized. For example, a middlebox that understands Content-Encoding: zstd may fail to relay traffic that identifies itself as zstd-compressed but that it cannot decompress.

@tokers, worrying about whether it's legal to use a non-standard identifier to describe non-standard behavior seems to me like it misses the point. In fact, using a standardized identifier to advertise non-standard behavior seems to me like the worst possible option. It would be much better to advertise non-standard behavior with a non-standard id.

And yes, it is legal to do so. Content-coding identifiers should be registered with IANA, but are not required to be (source). Though it looks like my suggestion to add an x- prefix has since been deprecated (source).

@felixhandte

And yes, it is legal to do so. Content-coding identifiers should be registered with IANA, but are not required to be (source). Though it looks like my suggestion to add an x- prefix has since been deprecated (source).

OK, got it.

I think the negotiation way can be diverse, according to the type, for example, if we are requesting the resource over the TLS/SSL layer, we can advertise the dict ID through the TLS/SSL extensions, just like the ALPN/NPN. Otherwise we can negotiate it by the HTTP header, like the plain HTTP/2 upgrade and the WebSocket handshake.

Look forwarding to your efforts!