FStarLang/karamel

FStar_String_strlen() is unsound given a string containing UTF-8 multibyte-encoded characters

landonf opened this issue ยท 3 comments

FStar_String_strlen() returns the number of bytes, whereast FStar.String.strlen() returns the number of code points.

Test case:

module Strlen

module ST = FStar.HyperStack.ST

inline_for_extraction noextract
let u8mbstr = "๐Ÿ“š" // U+1F4DA (Books Emoji); encoded in utf8 as '\xf0\x9f\x93\x9a'
let _ = assert_norm(String.strlen u8mbstr == 1) // String.strlen returns the number of codepoints

let main (argc:Int32.t) (argv: FStar.Buffer.buffer (FStar.Buffer.buffer C.char)) 
: ST.Stack C.exit_code
  (requires (fun _ -> True))
  (ensures (fun _ _ _ -> True))
= let ex1 = String.strlen u8mbstr in
  let ex2 = normalize_term (String.strlen u8mbstr) in

  if ex1 = ex2 then
    (printf "ex1 = ex2 (correct!)\n" done; C.EXIT_SUCCESS)
  else
    (printf "ex1 <> ex2 (incorrect!)\n" done; C.EXIT_FAILURE)

Generated C:

exit_code main(int32_t argc, char **argv)
{
  Prims_int ex1 = FStar_String_strlen("\xf0\x9f\x93\x9a");
  Prims_int ex2 = (krml_checked_int_t)1;
  if (ex1 == ex2)
  {
    LowStar_Printf_print_string("ex1 = ex2 (correct!)\n");
    return EXIT_SUCCESS;
  }
  else
  {
    LowStar_Printf_print_string("ex1 <> ex2 (incorrect!)\n");
    return EXIT_FAILURE;
  }
}

Result:

krml -verbose -no-prefix Strlen -bundle 'Strlen=*' Strlen.fst && ./a.out
ex1 <> ex2 (incorrect!)

I'm required to state: Not a Contribution.

Excellent bug report, thanks. This got cleaned up a while ago on the F* side and I never updated the corresponding C implementations. (I'd also have to check that the F* lexer rejects string literals that contain \0).

I'm also realizing that FStar.String.strlen, contrary to what I wrote in the tutorial, is not exactly valid Low* because it returns a mathematical integer, which probably triggered Warning 15 ("this code is not low*: it uses mathematical integers"). I guess I could add a strlen64 version. Are you using strings for anything beyond printing and debugging?

I'm also realizing that index_of is also probably quite incorrect...

Are you using strings for anything beyond printing and debugging?

Using string literals a bit (via LowStar.Literal), operating solely on their immutable buffer representations.