vshymanskyy/muon

Handling of strings with nul bytes in them

timando opened this issue · 8 comments

When I try and round-trip the following json, it doesn't work properly.
["stuff", "things", "zero\u0000things", "multi\u0000\u0000zero\u0000things"]

It works if we use length-prefixed strings instead of null terminated ones.

Currently encoder doesn't take it into account, but the format itself is capable of representing this with fixed-length strings. Will implement soon

Fixed. This is also related to #3

in this simple example, the resulting Muon file is 48 bytes compared to 74-byte minified JSON.
So 35% smaller

dgl commented

This is interesting to consider for someone wanting to write a high performance (but safe, validating) encoder, using NUL as the termination results in an edge case to deal with: the string is valid UTF-8, the string is valid UTF-8 but contains a NUL, or it's not valid UTF-8.

Using say 0xFF as both the tag pad and string terminator would mean there's only two cases, either valid UTF-8 (can stream encoding on the fly and only need to look at the byte stream once), or not valid and the encoder would error out. It does mean it wouldn't be consistent with typed arrays (and maybe that is another route to allow a high performance encoder, allow chunked strings via the implied idea in #3 -- which has the benefit of keeping C-string compatibility which I assume is the point of the non-length encoding?).

One slightly crazy case here is Perl's extended UTF-8, which will actually use 0xFF on the wire but I'm perfectly fine with that needing to be encoded as binary instead.

Edited to add: The more I think about this the more I think it's fine as it is (both this and #3), I wasn't thinking about C-string compatibility and keeping that seems valuable, the length based encoding (without the nul termination idea) only has an overhead for longer strings anyway, I also don't think chunked encoding for strings is a good idea, as it loses the nice property that strings currently have of being as-is on the wire.

Potentially related: could we plainly chain (concatenate as cat does) muon on-the-wire data and feed them into muon reader/parser?

(in the past I had fun thinking about somewhat similar idea but with not so much convincing result 😉)

@dgl zero termination is used specificly for the convenience of languages that use zero-terminated strings. overall, unicode strings containing zeroes are (should be) very rare. When optimizing Muon for speed, large strings should always be encoded as fixed-size. For small strings, it's easy to check if they contain a null byte.