`string.new_wtf8` versus invalid byte sequences

Question

`string.new_wtf8` versus invalid byte sequences

wingo opened this issue 3 years ago · 1 comments

Consider what happens when you invoke string.new_wtf8 with invalid data. There are different users:

Some users would want to trap if the bytes are not valid UTF-8. These users do not expect to ever receive invalid UTF-8 and would rather trap than handle invalid UTF-8. The current behavior that accepts WTF-8 isolated surrogates is inappropriate for these users.
Some users would want to trap if the bytes are not valid WTF-8. These users are custom-built for interoperation with Java/JavaScript strings. As WTF-8 isn't a Unicode encoding scheme and should only originate in program-controlled values, we can assume these strings are well-encoded, and that therefore we do not expect to see invalid WTF-8; if we do see invalid WTF-8, trapping is the right answer. string.new_wtf8 is great for these users.
Some users would want to detect invalid UTF-8 and handle this situation with their own logic. These users may receive UTF-8 over the internet. The current behavior that traps on invalid UTF-8 and that isolated surrogates is inappropriate for these users.
Some users would just want to keep on trucking, replacing bad sequences with U+FFFD. The Unicode standard sets out guidelines for replacing invalid subsequences in a section entitled "U+FFFD Substitution of Maximal Subparts" (page 126 of Unicode 14.0.0). The current behavior that never inserts U+FFFD is inappropriate for these users.

I know that serving all users isn't quite in the requirements, but I think it would be an error to add something to the internet that by default accepts WTF-8 in situations where only UTF-8 is valid.

I think we could fix this by adding an immediate $wtf8_policy operand to string.new_wtf8. The replace policy would use the sloppy UTF-8 decoder. Web browsers happen to implement this already, so it's not much implementation burden there.

We could add a string.validate_utf8 instruction that would validate bytes in memory (or, in a future variant, in a GC array) and return 1 or 0. This would only be useful for use case (3), but then which offset was invalid and what to do with it would be up to the user. I think that perhaps there are few such users and as the functionality could be provided by these users, we can avoid adding it.

Answer 1 · 2022-06-23T08:59:11.000Z

I believe that this one is fixed via #22. However I will leave it open for a little while in case there are comments.