ArchiveTeam/wpull

Replace buggy urllib.parse

JustAnotherArchivist opened this issue · 0 comments

Python's URL parsing with urllib.parse.urlparse works well for the most common formats, but it quickly breaks down in edge or corner cases. This caused ArchiveBot job 33k8egvaa5dsfxva1s0lsnmv4 to crash with an Invalid IPv6 address error on the URL http://[email=%22info@epic4health.com/, which is odd but perfectly parseable per the URL Standard (though it would produce a validation error since credentials are not allowed in valid URLs). There is a list of various similar issues at https://bugs.python.org/issue36338#msg355322.

Because urllib works fine in most cases, there aren't many alternative URL parser packages. A promising candidate is whatwg-url (repo), which is an implementation of the URL Standard.