creationix/nibs

[Proposal] Add back in some of the extended types.

creationix opened this issue · 6 comments

The nibs spec was reduced to the bare minimum to help implementors get started as we figured out what types were useful. This proposal is adding back some of the simpler types back in:

The proposed new type enum would be:

enum Type {

    // Inline types.
    Integer       = 0,  // big = num
    NegInt        = 1,  // big = -num
    FloatingPoint = 2,  // big = binary encoding of float
    Simple        = 3,  // big = subtype
    Ref           = 4,  // big = index into runtime references table    
    // 5-6 reserved for future inline types.
    // decoders can assume the nibs pair is all that needs skipping

    // Tag Type - This is always followed by another nibs value
    Tag           = 7,  // big = index into runtime custom types table
    // decoders must read the next value recursively when skipping

    // Prefixed length types.
    Bytes         = 8,  // big = len
    String        = 9,  // big = len
    Tuple         = 10, // big = len
    Map           = 11, // big = len
    Array         = 12, // big = len
    // 13-15 reserved for future prefixed length native types
    // decoders can assume they will always have big = len
};

Refs

Refs are simply numerical references into some out-of-band table of values. It allows compression when the code reading the nibs structure had context about what the references are.

In the proposed text format these are encoded as &[0-9]+. In the binary format these get a new type field.

In both cases, the data is a single u64 (decimal in text or big half of nibs pair in binary).

Libraries are expected to have an API where clients can register values to indices and when decoding these refs the values are looked up automatically.

Tags

Tags are custom userspace types. They are encoded similar to refs except they prefix another existing nibs value (including another tag recursively if desired).

In the proposed text format, these are encoded as @[0-9]+. In the binary format these get a new type field.

In both cases, the data is a single u64 (decimal in text or big half of nibs pair in binary).

Libraries are expected to have an API where clients can register custom data types as encoder and decoder functions.

For example a custom type can be a bloom filter using a tag and a large byte array. The custom decoder can return a map-like interface that returns true or false when indexing a key, but has no keys if iteration is attempted or length if queried.

Arrays

An array is a variant of the existing List type, but after the nibs pair and before the list of values there is another nibs pair header and an array of fixed-sized offset pointers.

This allows a decoder trying to index the Nth item in the array to jump directly to the offset pointer and then jump directly to the value giving O(1) lookups instead of the normal O(n) for lists.

The header nibs pair has:

  • small is the byte width of the pointers
  • big is the count of items in the index

There is no point in having an index in the proposed textual representation, but we do want to preserve the type information so these are encoded as #[...] instead of the normal [] JSON syntax. When an encoder converts these back to nibs, it needs to generate and insert the array index.

In the binary format a new top-level type is allocated next to the other containers.

jjg commented

One question that comes to mind is if Tag could be used to provide any of these types dynamically (instead of being baked-in), but given their functionality I'm not sure if that's an option.

Yes, tag can be used to implement any type in userspace, but I prefer array in core since it's simple and very useful.

I'm considering adding back in the HAMT as well, but not the hashing. The input is just the 64 bit integer and the full integer is stored in the payload.

edit: nevermind, this is too complex for core for now.

While working through the text format, I think it would make sense to rename list to tuple to match the proposed (...) syntax.

Ok, made some progress on this in the PR. The current changes are two main categories:

  • Smaller wire size for typical string heavy JSON documents.

    • New Ref and RefScope types to enable only storing repeated values once.
    • New HexString type to compress storage of hex strings.
    • New StringChain type to allow parts of a string to be stored differently (for example mixing refs, strings, and hexstrings)
  • Indexes for faster lookups in large container values

    • New Array type that gives O(1) indexing into lists using a flat pointer array index.
    • New Trie type that gives O(log n) indexing into maps using a HAMT index.

I decided to not add Tag at this time.

I'm also considering a couple new types that are a more efficient version of Trie

The problem with the current Trie design is you still have to store the full keys in the map portion. There are some use cases where this isn't needed and a more compact representation would be preferred. One idea is a new SparseArray type where each array pointer also has an integer index associated with it. These can be hashes of the keys. Finding a value can be done in O(log n) time using binary search through the array of hashes and pointer. It's not clear how this would map to the text format or client language semantics.

So for the SparseArray type, a value would need extra parameters like Array and Trie, but we actually need 3 numbers for pointer-width, array-count, and key-width. The pointer-width and array-count work exactly like the Array type, but next to each fixed width pointer is a fixed width integer that is the sparse index for that entry. This is a different width than the offsets.

There is maybe enough in the 4-bit small number in the nibs pair for both pointer width and index width if we encode differently.

For example, consider a simple example with 8-bit pointers, 64-bit hashes, and 4 entries. The index data would be written in rows 9 bytes wide.

HHHHHHHHP
HHHHHHHHP
HHHHHHHHP
HHHHHHHHP

We could limit to powers of 2 widths and need less bits for the widths. Pointers only need to be 1, 2, 4, or 8 bytes wide. Hashes probably want more range, but we could make do with the same range if needed. If the sparse indexes (aka hashes) had the same range, then we only need 2 bits for each.

00 00 // hash: 1 pointer: 1
00 01 // hash: 1 pointer: 2
00 10 // hash: 1 pointer: 4
00 11 // hash: 1 pointer: 8
01 00 // hash: 2 pointer: 1
01 01 // hash: 2 pointer: 2
01 10 // hash: 2 pointer: 4
01 11 // hash: 2 pointer: 8
10 00 // hash: 4 pointer: 1
10 01 // hash: 4 pointer: 2
10 10 // hash: 4 pointer: 4
10 11 // hash: 4 pointer: 8
11 00 // hash: 8 pointer: 1
11 01 // hash: 8 pointer: 2
11 10 // hash: 8 pointer: 4
11 11 // hash: 8 pointer: 8

We could even bias or customize the hash widths to support a larger range. 8 bit hashes aren't worth much in practice. Common hash sizes are:

  • 64 bit (xxhash64)
  • 128 bit (MD5, xxhash128)
  • 160 bit (SHA1)
  • 256 bit (SHA256 and lots of modern hashes)