Encourage denoting character-attributable errors by the REPLACEMENT CHARACTER
hsivonen opened this issue · 3 comments
What is the issue with the URL Standard?
The URL Standard gives advice about URL rendering:
https://url.spec.whatwg.org/#ref-for-concept-domain-to-unicode%E2%91%A0
It also in the https://url.spec.whatwg.org/#concept-host-parser section says: "Alternatively UTF-8 decode without BOM or fail can be used, coupled with an early return for failure, as domain to ASCII fails on U+FFFD (�).", which is the opposite remark of what I'm asking for here.
UTS 46 says: "Implementations may make further modifications to the resulting Unicode string when showing it to the user. For example, it is recommended that disallowed characters be replaced by a U+FFFD to make them visible to the user."
It would be useful for the URL Standard to highlight this technique and to include a Note to encourage letting U+FFFD from UTF-8 decode flow through the processing and to replace erroneous code points during UTS 46 processing and forbidden domain code point processing with U+FFFD so that errors that are attributable to specific things in the domain are visualized to the user. Since U+FFFD is itself a disallowed character, this technique preserves the overall failure status of the domain.