Hypothesis: builtins.UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
wsanchez opened this issue · 7 comments
The Hypothesis strategies now shipping with Hyperlink are producing this error occasionally in Klein:
Traceback (most recent call last):
324
File "/home/runner/work/klein/klein/.tox/coverage-py37-tw192/lib/python3.7/site-packages/klein/test/test_request_compat.py", line 74, in test_uri
325
def test_uri(self, url: DecodedURL) -> None:
326
File "/home/runner/work/klein/klein/.tox/coverage-py37-tw192/lib/python3.7/site-packages/hypothesis/core.py", line 1163, in wrapped_test
327
raise the_error_hypothesis_found
328
File "/home/runner/work/klein/klein/.tox/coverage-py37-tw192/lib/python3.7/site-packages/hyperlink/hypothesis.py", line 321, in decoded_urls
329
return DecodedURL(draw(encoded_urls()))
330
File "/home/runner/work/klein/klein/.tox/coverage-py37-tw192/lib/python3.7/site-packages/hyperlink/_url.py", line 2046, in __init__
331
self.host, self.userinfo, self.path, self.query, self.fragment
332
File "/home/runner/work/klein/klein/.tox/coverage-py37-tw192/lib/python3.7/site-packages/hyperlink/_url.py", line 2179, in path
333
for p in self._url.path
334
File "/home/runner/work/klein/klein/.tox/coverage-py37-tw192/lib/python3.7/site-packages/hyperlink/_url.py", line 2179, in <listcomp>
335
for p in self._url.path
336
File "/home/runner/work/klein/klein/.tox/coverage-py37-tw192/lib/python3.7/site-packages/hyperlink/_url.py", line 766, in _percent_decode
337
return unquoted_bytes.decode(subencoding)
338
builtins.UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
339
340
klein.test.test_request_compat.HTTPRequestWrappingIRequestTests.test_uri
341It would be helpful to catch this error and print the URL that produced it, so one might see what data is tripping us up.
Here are some failing examples:
error-causing bytes: b'\x80'
URL: URL.from_text('http://0.0/%80')error-causing bytes: b'\xe1\x8c\x84\xc3\xa9\xf1\xb1\xa9\x9d\x9b'
URL: URL.from_text('https://ɓ.ő𣫫á:26/ጄé\U00071a5d%9b')error-causing bytes: b'\xe1\x8c\x84\xc3\xa9\xf1\xb1\xa9\x9d\x9b0'
URL: URL.from_text('https://𐎹pɓ.ő𣫫á:51159/ጄé\U00071a5d%9b0/E7*\x13𐬃\x94\x8e')error-causing bytes: b'\xe1\x8c\x84\xc3\xa9\xf1\xb1\xa9\x9d\x9b0'
URL: URL.from_text('https://𐎹p1ɜ10貭.в.𢙑dɓ.ő𣫫á:51159/ጄé\U00071a5d%9b0/E7*\x13\U0004216a\x9d𠤈\x94\x8e')…which one can reproduce in the REPL:
>>> from hyperlink import EncodedURL, DecodedURL
>>> encodedURL = EncodedURL.from_text('http://0.0/%80')
>>> encodedURL
URL.from_text('http://0.0/%80')
>>> DecodedURL(encodedURL)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/wsanchez/Dropbox/Developer/Twisted/klein/.tox/coverage-py38-twcurrent/lib/python3.8/site-packages/hyperlink/_url.py", line 2046, in __init__
self.host, self.userinfo, self.path, self.query, self.fragment
File "/Users/wsanchez/Dropbox/Developer/Twisted/klein/.tox/coverage-py38-twcurrent/lib/python3.8/site-packages/hyperlink/_url.py", line 2177, in path
[
File "/Users/wsanchez/Dropbox/Developer/Twisted/klein/.tox/coverage-py38-twcurrent/lib/python3.8/site-packages/hyperlink/_url.py", line 2178, in <listcomp>
_percent_decode(p, raise_subencoding_exc=True)
File "/Users/wsanchez/Dropbox/Developer/Twisted/klein/.tox/coverage-py38-twcurrent/lib/python3.8/site-packages/hyperlink/_url.py", line 766, in _percent_decode
return unquoted_bytes.decode(subencoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte>>> encodedURL = EncodedURL.from_text('https://ɓ.ő𣫫á:26/ጄé\U00071a5d%9b')
>>> encodedURL
URL.from_text('https://ɓ.ő𣫫á:26/ጄé\U00071a5d%9b')
>>> DecodedURL(encodedURL)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/wsanchez/Dropbox/Developer/Twisted/klein/.tox/coverage-py38-twcurrent/lib/python3.8/site-packages/hyperlink/_url.py", line 2046, in __init__
self.host, self.userinfo, self.path, self.query, self.fragment
File "/Users/wsanchez/Dropbox/Developer/Twisted/klein/.tox/coverage-py38-twcurrent/lib/python3.8/site-packages/hyperlink/_url.py", line 2177, in path
[
File "/Users/wsanchez/Dropbox/Developer/Twisted/klein/.tox/coverage-py38-twcurrent/lib/python3.8/site-packages/hyperlink/_url.py", line 2178, in <listcomp>
_percent_decode(p, raise_subencoding_exc=True)
File "/Users/wsanchez/Dropbox/Developer/Twisted/klein/.tox/coverage-py38-twcurrent/lib/python3.8/site-packages/hyperlink/_url.py", line 766, in _percent_decode
return unquoted_bytes.decode(subencoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9b in position 9: invalid start byteI think DecodedURL maybe has a bit of leeway with a URL like this to mangle it or make it not completely round-trip-able through every API. Browsers have to cope with this kind of a mess, and they definitely do some mangling. For example, if you try pasting https://example.com/%80é into Safari or Chrome, you get https://example.com/%80%C3%A9. Now, granted, that's a bit more like an EncodedURL, but you can deliver the percent-encoded text directly to the application in that case. Because if you manually delete the %80, you'll notice that you get https://example.com/é back again, visually.
If you were to manipulate a busted URL like this, or manually create a copy via moving strings with DecodedURL, you'd get %2580%25C3%25A9 - but I think that's fine. Maybe there should be a switch about whether to raise or mangle on encoding errors when you create the object?