lemire/Code-used-on-Daniel-Lemire-s-blog

UTF8 validity check gives incorrect results

zoltantirinda opened this issue · 1 comments

Hi Daniel,

I was playing with your functions and found, that the validate_utf8_double(...) and validate_utf8_branchless(...) functions in

https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/blob/master/2018/05/08/checkutf8.c

gives incorrect results.

  1. In the while loop, instead of having c[half] > 0x80 you should have c[half] >= 0x80
  2. for (int j = half * 2; j < len; j++) gives warning. Instead of int, you should use size_t
  3. const char* invalid1 = "\xC3\x28"; and const char* invalid12 = "\xC2\x7F"; gives incorrect result, where s1 == 2 (not 1), so the condition at the end (s1 != UTF8_REJECT) && (s2 != UTF8_REJECT) gives true. Shouldn't we check here against UTF8_ACCEPT using (state1 == UTF8_ACCEPT) && (state2 == UTF8_ACCEPT)?
  4. The same problem is in validate_utf8_branchless(...) as well. For const char* invalid1 = "\xC3"; the state == 2 (not 1) and the last comparison state != UTF8_REJECT gives true. I think this should be changed to state == UTF8_ACCEPT.

I recommend against using this code, I have added a link in the header to the library that we do support...

https://github.com/lemire/fastvalidate-utf-8

Thanks for your report. I am not going to fix this file, since we have a bona fide library with tests... but if you want to issue a PR that fixes the issue you raise, I will take it.