`string.new_wtf8` versus invalid byte sequences
wingo opened this issue · 1 comments
Consider what happens when you invoke string.new_wtf8
with invalid data. There are different users:
- Some users would want to trap if the bytes are not valid UTF-8. These users do not expect to ever receive invalid UTF-8 and would rather trap than handle invalid UTF-8. The current behavior that accepts WTF-8 isolated surrogates is inappropriate for these users.
- Some users would want to trap if the bytes are not valid WTF-8. These users are custom-built for interoperation with Java/JavaScript strings. As WTF-8 isn't a Unicode encoding scheme and should only originate in program-controlled values, we can assume these strings are well-encoded, and that therefore we do not expect to see invalid WTF-8; if we do see invalid WTF-8, trapping is the right answer.
string.new_wtf8
is great for these users. - Some users would want to detect invalid UTF-8 and handle this situation with their own logic. These users may receive UTF-8 over the internet. The current behavior that traps on invalid UTF-8 and that isolated surrogates is inappropriate for these users.
- Some users would just want to keep on trucking, replacing bad sequences with U+FFFD. The Unicode standard sets out guidelines for replacing invalid subsequences in a section entitled "U+FFFD Substitution of Maximal Subparts" (page 126 of Unicode 14.0.0). The current behavior that never inserts U+FFFD is inappropriate for these users.
I know that serving all users isn't quite in the requirements, but I think it would be an error to add something to the internet that by default accepts WTF-8 in situations where only UTF-8 is valid.
I think we could fix this by adding an immediate $wtf8_policy
operand to string.new_wtf8
. The replace
policy would use the sloppy UTF-8 decoder. Web browsers happen to implement this already, so it's not much implementation burden there.
We could add a string.validate_utf8
instruction that would validate bytes in memory (or, in a future variant, in a GC array) and return 1 or 0. This would only be useful for use case (3), but then which offset was invalid and what to do with it would be up to the user. I think that perhaps there are few such users and as the functionality could be provided by these users, we can avoid adding it.