pallets/werkzeug

drop support for bytes

Closed this issue · 4 comments

Most of the library should only work with strings. Being able to pass bytes to most functions is an artifact of only supporting Python 2, then also supporting 3, then dropping support for 2. WSGI on Python 3 only deals with ISO-8859-1 characters. Modern WHATWG HTML and URL standards require UTF-8. HTTP headers must be ISO-8859-1 (new headers should be ASCII), using quoting or other encoding schemes to first convert UTF-8 to valid characters before encoding to bytes.

As #2406 points out, we spend too much time doing instance checks and encoding. This is often redundant because we currently support bytes or strings being passed to any function, and one function calls others that do the same. Very few places should be allowing bytes and strings, we should be dealing with strings for most data, and bytes only where binary data makes sense.

I've been working on this a bit, and after removing enough there is a significant speedup. It's not trivial though.

So many different places support str and bytes in so many different ways, with indirection through helper functions, that it's hard to know if I've caught all of them. It would also require adding deprecation warnings in a huge number of places. It's a huge amount of effort to deprecate, I'm not sure if I'll catch everything, catching everything might actually require more checks in the short term, making things even slower.

I'm wondering if it might be better to just announce a blanket policy, just remove bytes support as we find more of it, and expect users to respond to errors instead of warnings. I have a feeling using bytes anywhere is pretty rare at this point for any app that is staying reasonably up to date.

The one issue I'm not sure how to support is that being able to pass bytes to some URL related functions means you can technically encode \xHH style bytes with percent encoding, even if they don't correspond to UTF-8. This "works" but I'm not clear that it's correct or desirable. It's probably better to just stop supporting this and require people to get on UTF-8. That's what the WHATWG HTML and URL Standards do.

My strategy right now is to add deprecation warnings if the type annotations or docs stated that bytes were supported, or it tests failed when support was removed. For places that weren't documented or tested, I'm just removing it. This seems to be a good metric for cutting down on the amount of extra checks added to the code, although there's still quite a few.

I've also been looking at the charset attribute of requests and responses. I'm strongly considering deprecating it and always using UTF-8.

They default to UTF-8, and while it's lightly documented that it's possible to customize that, I can't find examples on GitHub doing that (which is admittedly difficult to search for) or 3rd-party tutorials mentioning it. We explicitly do not trust the encoding in the response's Content-Type header, and I'm guessing that makes it much more likely that devs fix the encoding of their HTML document so that it's UTF-8.

The WHATWG HTML and URL standards (and others like fetch) both use UTF-8. The only time a request from a browser would contain non-UTF-8 data is if the encoding in the response serving the page was not UTF-8. That's under our control, if we send UTF-8, we will receive it. We already send UTF-8 by default, which means anything not deliberately changing the response charset is already receiving UTF-8.

A quick look at some WSGI and ASGI servers shows that they all are using value.encode("latin1") to convert header bytes to text (that's part of WSGI, but not ASGI). Assumptions of the "real" encoding are inconsistent, and according to the HTTP spec headers should consist of ASCII only, although they may include all 1-byte values (ISO-8859-1 / Latin-1). But it's not really possible to know what any given "legacy" header is encoded as.

When decoding percent-encoded URL paths and query strings, we can leave invalid bytes percent-encoded instead of replacing them as we default to now. We already have a special internal errors="werkzeug.url_quote" handler for IRI conversion.

The cookie spec is really bad about encoding, but Python's http.cookies used a weird \123 slash octal encoding and we copied that. Therefore our cookies should be ASCII-only, and likely used UTF-8 to slash encode non-ASCII.

application/x-www-form-urlencoded should be ASCII token chars and percent encoding (UTF-8), both for the URL and for legacy form data. multipart/form-data would be UTF-8 when the originating document was UTF-8, and the charset multipart header should be ignored for the same reason the request header is. JSON is always UTF-something.

Wikipedia on UTF-8 cites https://w3techs.com/technologies/overview/character_encoding:

UTF-8 is the dominant encoding for the World Wide Web (and internet technologies), accounting for 97.9% of all web pages, over 99.0% of the top 10,000 pages, and up to 100.0% for many languages, as of 2023.

Just like other deprecations, this deprecation would give the small percentage of sites an opportunity to see the warning and update, or pin to the current version. The fix would be either sending their HTML as UTF-8, or otherwise updating their client code to send UTF-8 instead of something else.

In the rare case that UTF-8 is still not correct, request.data is bytes, and can be decoded appropriately for the specific case. If invalid bytes in URLs remain percent encoded, the string can be reencoded and unquoted appropriately. request.headers contains the latin-1 encoded strings, which can be re-encoded appropriately. It doesn't matter what the internal encoding for escaping is, since we control both sides of that. Users could also pre-encode non-ASCII data, the cookie spec for example suggests base64, that's already what itsdangerous does.

Using the 'replace' decoding error handler means that the app will end up processing junk data from clients sending incorrectly encoded bytes. It seems like raising a 400 error would be better, but I'm not doing that for now.