python-hyper/hyperlink

Hypothesis: builtins.UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

wsanchez opened this issue · 7 comments

The Hypothesis strategies now shipping with Hyperlink are producing this error occasionally in Klein:

Traceback (most recent call last):
324
  File "/home/runner/work/klein/klein/.tox/coverage-py37-tw192/lib/python3.7/site-packages/klein/test/test_request_compat.py", line 74, in test_uri
325
    def test_uri(self, url: DecodedURL) -> None:
326
  File "/home/runner/work/klein/klein/.tox/coverage-py37-tw192/lib/python3.7/site-packages/hypothesis/core.py", line 1163, in wrapped_test
327
    raise the_error_hypothesis_found
328
  File "/home/runner/work/klein/klein/.tox/coverage-py37-tw192/lib/python3.7/site-packages/hyperlink/hypothesis.py", line 321, in decoded_urls
329
    return DecodedURL(draw(encoded_urls()))
330
  File "/home/runner/work/klein/klein/.tox/coverage-py37-tw192/lib/python3.7/site-packages/hyperlink/_url.py", line 2046, in __init__
331
    self.host, self.userinfo, self.path, self.query, self.fragment
332
  File "/home/runner/work/klein/klein/.tox/coverage-py37-tw192/lib/python3.7/site-packages/hyperlink/_url.py", line 2179, in path
333
    for p in self._url.path
334
  File "/home/runner/work/klein/klein/.tox/coverage-py37-tw192/lib/python3.7/site-packages/hyperlink/_url.py", line 2179, in <listcomp>
335
    for p in self._url.path
336
  File "/home/runner/work/klein/klein/.tox/coverage-py37-tw192/lib/python3.7/site-packages/hyperlink/_url.py", line 766, in _percent_decode
337
    return unquoted_bytes.decode(subencoding)
338
builtins.UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
339

340
klein.test.test_request_compat.HTTPRequestWrappingIRequestTests.test_uri
341

It would be helpful to catch this error and print the URL that produced it, so one might see what data is tripping us up.

Here are some failing examples:

error-causing bytes: b'\x80'
URL: URL.from_text('http://0.0/%80')
error-causing bytes: b'\xe1\x8c\x84\xc3\xa9\xf1\xb1\xa9\x9d\x9b'
URL: URL.from_text('https://ɓ.ő𣫫á:26/ጄé\U00071a5d%9b')
error-causing bytes: b'\xe1\x8c\x84\xc3\xa9\xf1\xb1\xa9\x9d\x9b0'
URL: URL.from_text('https://𐎹pɓ.ő𣫫á:51159/ጄé\U00071a5d%9b0/E7*\x13𐬃\x94\x8e')
error-causing bytes: b'\xe1\x8c\x84\xc3\xa9\xf1\xb1\xa9\x9d\x9b0'
URL: URL.from_text('https://𐎹p1ɜ10貭.в.𢙑dɓ.ő𣫫á:51159/ጄé\U00071a5d%9b0/E7*\x13\U0004216a\x9d𠤈\x94\x8e')

…which one can reproduce in the REPL:

>>> from hyperlink import EncodedURL, DecodedURL
>>> encodedURL = EncodedURL.from_text('http://0.0/%80')
>>> encodedURL
URL.from_text('http://0.0/%80')
>>> DecodedURL(encodedURL)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/wsanchez/Dropbox/Developer/Twisted/klein/.tox/coverage-py38-twcurrent/lib/python3.8/site-packages/hyperlink/_url.py", line 2046, in __init__
    self.host, self.userinfo, self.path, self.query, self.fragment
  File "/Users/wsanchez/Dropbox/Developer/Twisted/klein/.tox/coverage-py38-twcurrent/lib/python3.8/site-packages/hyperlink/_url.py", line 2177, in path
    [
  File "/Users/wsanchez/Dropbox/Developer/Twisted/klein/.tox/coverage-py38-twcurrent/lib/python3.8/site-packages/hyperlink/_url.py", line 2178, in <listcomp>
    _percent_decode(p, raise_subencoding_exc=True)
  File "/Users/wsanchez/Dropbox/Developer/Twisted/klein/.tox/coverage-py38-twcurrent/lib/python3.8/site-packages/hyperlink/_url.py", line 766, in _percent_decode
    return unquoted_bytes.decode(subencoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
>>> encodedURL = EncodedURL.from_text('https://ɓ.ő𣫫á:26/ጄé\U00071a5d%9b')
>>> encodedURL
URL.from_text('https://ɓ.ő𣫫á:26/ጄé\U00071a5d%9b')
>>> DecodedURL(encodedURL)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/wsanchez/Dropbox/Developer/Twisted/klein/.tox/coverage-py38-twcurrent/lib/python3.8/site-packages/hyperlink/_url.py", line 2046, in __init__
    self.host, self.userinfo, self.path, self.query, self.fragment
  File "/Users/wsanchez/Dropbox/Developer/Twisted/klein/.tox/coverage-py38-twcurrent/lib/python3.8/site-packages/hyperlink/_url.py", line 2177, in path
    [
  File "/Users/wsanchez/Dropbox/Developer/Twisted/klein/.tox/coverage-py38-twcurrent/lib/python3.8/site-packages/hyperlink/_url.py", line 2178, in <listcomp>
    _percent_decode(p, raise_subencoding_exc=True)
  File "/Users/wsanchez/Dropbox/Developer/Twisted/klein/.tox/coverage-py38-twcurrent/lib/python3.8/site-packages/hyperlink/_url.py", line 766, in _percent_decode
    return unquoted_bytes.decode(subencoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9b in position 9: invalid start byte

@glyph @mahmoud I'm curious if you think this may suggest a bug in Hyperlink… that we have allowed the creation of an EncodedURL which cannot be decoded…?

glyph commented

I think DecodedURL maybe has a bit of leeway with a URL like this to mangle it or make it not completely round-trip-able through every API. Browsers have to cope with this kind of a mess, and they definitely do some mangling. For example, if you try pasting https://example.com/%80é into Safari or Chrome, you get https://example.com/%80%C3%A9. Now, granted, that's a bit more like an EncodedURL, but you can deliver the percent-encoded text directly to the application in that case. Because if you manually delete the %80, you'll notice that you get https://example.com/é back again, visually.

glyph commented

If you were to manipulate a busted URL like this, or manually create a copy via moving strings with DecodedURL, you'd get %2580%25C3%25A9 - but I think that's fine. Maybe there should be a switch about whether to raise or mangle on encoding errors when you create the object?