FStar_String_strlen() is unsound given a string containing UTF-8 multibyte-encoded characters
landonf opened this issue ยท 3 comments
FStar_String_strlen() returns the number of bytes, whereast FStar.String.strlen() returns the number of code points.
Test case:
module Strlen
module ST = FStar.HyperStack.ST
inline_for_extraction noextract
let u8mbstr = "๐" // U+1F4DA (Books Emoji); encoded in utf8 as '\xf0\x9f\x93\x9a'
let _ = assert_norm(String.strlen u8mbstr == 1) // String.strlen returns the number of codepoints
let main (argc:Int32.t) (argv: FStar.Buffer.buffer (FStar.Buffer.buffer C.char))
: ST.Stack C.exit_code
(requires (fun _ -> True))
(ensures (fun _ _ _ -> True))
= let ex1 = String.strlen u8mbstr in
let ex2 = normalize_term (String.strlen u8mbstr) in
if ex1 = ex2 then
(printf "ex1 = ex2 (correct!)\n" done; C.EXIT_SUCCESS)
else
(printf "ex1 <> ex2 (incorrect!)\n" done; C.EXIT_FAILURE)
Generated C:
exit_code main(int32_t argc, char **argv)
{
Prims_int ex1 = FStar_String_strlen("\xf0\x9f\x93\x9a");
Prims_int ex2 = (krml_checked_int_t)1;
if (ex1 == ex2)
{
LowStar_Printf_print_string("ex1 = ex2 (correct!)\n");
return EXIT_SUCCESS;
}
else
{
LowStar_Printf_print_string("ex1 <> ex2 (incorrect!)\n");
return EXIT_FAILURE;
}
}
Result:
krml -verbose -no-prefix Strlen -bundle 'Strlen=*' Strlen.fst && ./a.out
ex1 <> ex2 (incorrect!)
I'm required to state: Not a Contribution.
Excellent bug report, thanks. This got cleaned up a while ago on the F* side and I never updated the corresponding C implementations. (I'd also have to check that the F* lexer rejects string literals that contain \0).
I'm also realizing that FStar.String.strlen, contrary to what I wrote in the tutorial, is not exactly valid Low* because it returns a mathematical integer, which probably triggered Warning 15 ("this code is not low*: it uses mathematical integers"). I guess I could add a strlen64
version. Are you using strings for anything beyond printing and debugging?
I'm also realizing that index_of
is also probably quite incorrect...
Are you using strings for anything beyond printing and debugging?
Using string literals a bit (via LowStar.Literal), operating solely on their immutable buffer representations.