HTML in JavaScript leads to undecoded character references in URLs

Question

HTML in JavaScript leads to undecoded character references in URLs

JustAnotherArchivist opened this issue 4 years ago · 0 comments

JustAnotherArchivist commented 4 years ago

When wpull encounters HTML inside JavaScript strings (or a JSON API), it does not decode character references on extracted URLs because it does not treat HTML in JS strings specially at all. This causes frequent & appearances in URLs. Further, if a numeric character references (&#nnn;) is involved, part of the URL is dropped entirely on parsing as everything after the hash is treated as the fragment (seen in ArchiveBot job 51nt0cax16fen2l8kv14kraon).

I'm not sure what the best strategy here is. Trying to detect whether a JS string contains HTML is probably expensive and may not be worth it. Attempting to decode char refs in JS-extracted URLs may be worth exploring though.