janestreet/sexplib

UTF-8 Safe Mode

dsheets opened this issue · 2 comments

Right now, sexplib uses String.escaped for serializing strings. If those strings contain high bytes that are not part of UTF-8 encoded sequences, they will be output as-is. This results in behavior like:

# Format.printf "%a@." Sexp.pp_mach (Sexp.of_string "(String\"\247\")");;
(String �)

When sexplib is used for logging and debugging, this can cause issues when UTF-8 valid text is expected. Perhaps the function used to escape strings could be parameterized? It would be really nice to efficiently (not generating and then iterating over the buffer checking for non-UTF-8 bytes and copying into another buffer) output UTF-8-safe strings.

Escaping all non-ascii characters seems like a good default. I submitted a change internally, it should be ready for the next release

Great! Thanks!