Tildes in links in Shift-JIS pages are interpreted as %E2%80%BE when html-parser is set to libxml2-lxml
DoomTay opened this issue · 0 comments
DoomTay commented
For example, an html file with a link set up like this
<meta http-equiv="Content-Type" content="text/html;charset=Shift_JIS">
<a href="~tildepath/">Bar</a>
the URL will be interpreted as http://localhost:8000/%E2%80%BEtildepath/
, which would then result in a 404.
Curiously, if I set wpull to start in http://localhost:8000/~tildepath/
, that page and and subsequent URLs are found properly so long as those URLs do not themselves have tildes
This turned up in ArchiveBot job 62thvsbqv0tn0af8fhhjklya3 as well as, judging from past logs, 5bi1u8ffbrtwnf2jb6d3prqwj, 6gjq81kbvhhcjvf6v5z4ysv4i and 2bkvkya714zxqkity2cmw1w10
This happens for sure when the html-parameter
is set to libxml2-lxml
in 2.0.1, but not 1.2.3. In addition, I found a similar issue mentioned with wget discussed here and here