getty-zig/json

Slice visitor's `visitString` method doesn't handle sentinel-terminated slices

Closed this issue · 3 comments

Description

Deserializing into a sentinel-terminated slice from a string does not work (a compiler error is raised).

The problem is because the slice visitor's visitString method always returns a non-terminated u8 slice (e.g., []u8 or []const u8, I forget). Those slices can't cast into a a sentinel-terminated one, hence the error.

How to Reproduce the Bug

const std = @import("std");
const json = @import("json");

pub fn main() !void {
    _ = try json.fromSlice(std.heap.page_allocator, [:0]u8, "\"foobar\"");
}

Additional Context

No response

Namek commented

So the case is similar to https://github.com/ziglang/zig/blob/master/lib/std/json.zig#L1689-L1693 however getty splits that code into 2 domains:

  1. Deserializer.deserializeString() which PRE-allocates memory, doesn't know (or does it?) the type of slice, and calls visitor.visitString()
  2. visitor.visitString() which knows the type (Value const field) and takes pre-allocated memory

The issue is that the 2nd would lack 1 byte for sentinel when a sentinel slice is returned conditionally.

I have not found a way to get info about sentinel in the 1st layer. If it was easily possible then it could conditionally call either allocator.dupe() or allocator.dupeZ().

One solution might be to delegate duplicataction and/or unescaping string inside of the Visitor instead of its caller.

Okay, I've decided to try shifting the allocation responsibility to the visitor. That would:

  1. Let us fix this problem since the visitor would be able to tell whether the final type is sentinel-terminated and thus allocate appropriately.
  2. Make things a bit simpler for deserializer methods (I think). They wouldn't need to allocate anything. But if they do (e.g., intermediate values), just always deallocate it before returning.

Now, Getty actually worked like that before. But I switched from it for reasons I can't seem to remember... I think it had to do with deserializing aggregate types (e.g., sequences, maps). However, I can't think of any reasons why this shouldn't work now though, so I'll give it a try and see what happens.

edit: I remember one reason: performance. The visitor has to copy the string to its allocated buffer, even though the string might already be allocated (say, for strings that needs escaping). Moving escaping into the visitor wouldn't work though, since they're supposed to be data format-agnostic and data formats each have their own escaping rules.

edit: I remember one reason: performance.

We can avoid this pretty easily with a deserializer-defined DB, so now I really can't think of any reason to not use this new allocation model. I really do need to document it somewhere though...