ArchiveTeam/wpull

Tildes in links in Shift-JIS pages are interpreted as %E2%80%BE when html-parser is set to libxml2-lxml

DoomTay opened this issue · 0 comments

For example, an html file with a link set up like this

<meta http-equiv="Content-Type" content="text/html;charset=Shift_JIS">
<a href="~tildepath/">Bar</a>

the URL will be interpreted as http://localhost:8000/%E2%80%BEtildepath/, which would then result in a 404.

Curiously, if I set wpull to start in http://localhost:8000/~tildepath/, that page and and subsequent URLs are found properly so long as those URLs do not themselves have tildes

This turned up in ArchiveBot job 62thvsbqv0tn0af8fhhjklya3 as well as, judging from past logs, 5bi1u8ffbrtwnf2jb6d3prqwj, 6gjq81kbvhhcjvf6v5z4ysv4i and 2bkvkya714zxqkity2cmw1w10

This happens for sure when the html-parameter is set to libxml2-lxml in 2.0.1, but not 1.2.3. In addition, I found a similar issue mentioned with wget discussed here and here