more precise description for UTF-8 behavior

Question

more precise description for UTF-8 behavior

matu3ba opened this issue 3 years ago · 4 comments

The README shows "Assumes valid UTF8, but does not misbehave if input contains bad UTF8", but this does not specify 1. for what this assumption is made, 2. what parts of UTF8 are checked (codepoints, grapheme clusters or more stuff).

Typically, one means codepoints, but this is not explicit in the text.

Answer 1 · 2022-04-04T06:48:39.000Z

Thank you for the comment. Could you clarify what you are suggesting should be changed?

The only time that anything related to this topic matters is when the input is bad UTF8. The documentation simply says "zsv won't misbehave when you give it bad UTF8". It doesn't really matter if you have a different definition of "bad UTF8" so long as whatever that is, zsv does not "misbehave".

Given that, is there some "bad UTF8" input that results in some sort of "misbehavior" that should be addressed? If so, could you please clarify what that input is, and what the resulting misbehavior is?

Answer 2 · 2022-04-04T07:48:15.000Z

Could you clarify what you are suggesting should be changed?

I would expect something along

"Assumes that necessary delimiter, newlines symbols etc from input data are identifiable and input is encoded as UTF-8",
"Checks validity of UTF-8 codepoints" (with user-feedback? is this optional or SIMD with neglible cost?)

Or do you go the longer way and do 2 parses: 1. read potential special symbols as ASCII, 2. read input as UTF-8 and check consistency with 1?
If this is the case: Is this optional?

Or what is the behavior? It is not evident for me what happens in the bad cases (I guess its just returning with an exit code?).

Answer 3 · 2022-04-05T05:41:22.000Z

I'm not really sure what your suggestion would accomplish. It does not seem that the description today is inaccurate, and it sounds like your suggested description likely would be.

Just to clarify, zsv does not check utf8 validity. The README never claims to do that, because it doesn't. There is a difference between checking something and assuming something to be true. The latter specifically does not check at all, because it assumes that checking is not required. Rather, the README simply says that if this assumption is wrong, it won't "misbehave" i.e. crash.

In other words, the README states-- accurately, afaik-- that zsv simply:

"Assumes UTF8 input". Translation: if you give it UTF8 input, it should give you correct output
"does not misbehave if input contains bad UTF8". Translation: if you give it non-UTF8 input, it will not crash (furthermore, it should continue to parse rows and cells if any valid UTF8 delimiters follow the invalid bytes, but there is no promise being made here. perhaps there should be, but if that's your comment, we need more info to do anything about it)

If you can provide an example of where either of the above statements is false, then by all means we should fix it in which case please provide the input that caused either incorrect output (in the case of valid UTF8) or a crash (in the case of invalid UTF8). Otherwise, the README seems accurate and complete as is

Answer 4 · 2022-04-05T07:09:00.000Z

Ok, that makes sense to me.
My main trouble was to understand what "misbehave" would be.