Replace buggy urllib.parse
JustAnotherArchivist opened this issue · 0 comments
Python's URL parsing with urllib.parse.urlparse
works well for the most common formats, but it quickly breaks down in edge or corner cases. This caused ArchiveBot job 33k8egvaa5dsfxva1s0lsnmv4 to crash with an Invalid IPv6 address
error on the URL http://[email=%22info@epic4health.com/
, which is odd but perfectly parseable per the URL Standard (though it would produce a validation error since credentials are not allowed in valid URLs). There is a list of various similar issues at https://bugs.python.org/issue36338#msg355322.
Because urllib works fine in most cases, there aren't many alternative URL parser packages. A promising candidate is whatwg-url (repo), which is an implementation of the URL Standard.