UTF8 validity check gives incorrect results
zoltantirinda opened this issue · 1 comments
zoltantirinda commented
Hi Daniel,
I was playing with your functions and found, that the validate_utf8_double(...)
and validate_utf8_branchless(...)
functions in
https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/blob/master/2018/05/08/checkutf8.c
gives incorrect results.
- In the while loop, instead of having
c[half] > 0x80
you should havec[half] >= 0x80
for (int j = half * 2; j < len; j++)
gives warning. Instead of int, you should use size_tconst char* invalid1 = "\xC3\x28";
andconst char* invalid12 = "\xC2\x7F";
gives incorrect result, where s1 == 2 (not 1), so the condition at the end(s1 != UTF8_REJECT) && (s2 != UTF8_REJECT)
gives true. Shouldn't we check here against UTF8_ACCEPT using(state1 == UTF8_ACCEPT) && (state2 == UTF8_ACCEPT)
?- The same problem is in
validate_utf8_branchless(...)
as well. Forconst char* invalid1 = "\xC3";
the state == 2 (not 1) and the last comparisonstate != UTF8_REJECT
gives true. I think this should be changed tostate == UTF8_ACCEPT
.
lemire commented
I recommend against using this code, I have added a link in the header to the library that we do support...
https://github.com/lemire/fastvalidate-utf-8
Thanks for your report. I am not going to fix this file, since we have a bona fide library with tests... but if you want to issue a PR that fixes the issue you raise, I will take it.