atdgen-ocaml: utf-8 Vs byte-array strings

Question

atdgen-ocaml: utf-8 Vs byte-array strings

Opened this issue 24 days ago · 0 comments

(I've the impression this should be an FAQ but could not find any discussion on this:)

Atdgen maps ATD “strings” to JSON strings which are supposed to be valid Unicode (UTF-8 in practice), and also directly to OCaml string values which can be arbitrary byte-arrays.

This makes it very easy to generate invalid JSON which then fails with other parsers:, e.g., this Gist shows Jsonm failing with "illegal bytes in character stream" while J.string_of_t0 |> J.t0_of_string succeeds.
The “data-encoding” world often uses this as default solution for byte-arrays: https://gitlab.com/nomadic-labs/data-encoding/-/blob/master/src/json.ml#L125-L145 → if a string is not UTF-8 it becomes an array of ints.

Should Mod_j functions have the option failing earlier if an input string is not valid? (I guess that would be having default or first-class-citizen validator entries? -j-pp seems to only work in one direction).

Does it make sense to add a byte-array core type to ATD?

Many tools already just don't care, should this just be documented somewhere properly?

Right now the ATD definition doc just says “Sequence of bytes or characters” …