WaybackURLKeyMaker to keep non-utf8 percent encodings
Opened this issue · 1 comments
sebastian-nagel commented
WaybackURLKeyMaker.makeKey(url)
replaces percent signs by %25
in percent-encoded URL with bytes not representing valid utf-8 encoded characters (before RFC 3986):
http://www.aluroba.com/tags/%C3%CE%CA%C7%D1%E5%C7.htm
-> com,aluroba)/tags/%25c3%25ce%25ca%25c7%25d1%25e5%25c7.htm
https://1kr.ua/newslist.html?tag=%E4%EE%F8%EA%EE%EB%FC%ED%EE%E5
-> ua,1kr)/newslist.html?tag=%25e4%25ee%25f8%25ea%25ee%25eb%25fc%25ed%25ee%25e5
Python's surt module behaves different which breaks look-up in CDX files for such URLs.
sebastian-nagel commented
Difficult to solve: Python (2.7) and Java have different string types, based on bytes resp. Unicode characters. The "surt" module used with Python 3 causes a similar problem (internetarchive/surt#19).