atdgen-ocaml: utf-8 Vs byte-array strings
Opened this issue · 0 comments
smondet commented
(I've the impression this should be an FAQ but could not find any discussion on this:)
Atdgen maps ATD “strings” to JSON strings which are supposed to be valid Unicode (UTF-8 in practice), and also directly to OCaml string
values which can be arbitrary byte-arrays.
- This makes it very easy to generate invalid JSON which then fails with other parsers:, e.g., this Gist shows Jsonm failing with
"illegal bytes in character stream"
whileJ.string_of_t0 |> J.t0_of_string
succeeds. - The “data-encoding” world often uses this as default solution for byte-arrays: https://gitlab.com/nomadic-labs/data-encoding/-/blob/master/src/json.ml#L125-L145 → if a string is not UTF-8 it becomes an array of ints.
Should Mod_j
functions have the option failing earlier if an input string is not valid? (I guess that would be having default or first-class-citizen validator
entries? -j-pp
seems to only work in one direction).
Does it make sense to add a byte-array
core type to ATD?
Many tools already just don't care, should this just be documented somewhere properly?
Right now the ATD definition doc just says “Sequence of bytes or characters” …