[Proposal] Add back in some of the extended types.
creationix opened this issue · 6 comments
The nibs spec was reduced to the bare minimum to help implementors get started as we figured out what types were useful. This proposal is adding back some of the simpler types back in:
The proposed new type enum would be:
enum Type {
// Inline types.
Integer = 0, // big = num
NegInt = 1, // big = -num
FloatingPoint = 2, // big = binary encoding of float
Simple = 3, // big = subtype
Ref = 4, // big = index into runtime references table
// 5-6 reserved for future inline types.
// decoders can assume the nibs pair is all that needs skipping
// Tag Type - This is always followed by another nibs value
Tag = 7, // big = index into runtime custom types table
// decoders must read the next value recursively when skipping
// Prefixed length types.
Bytes = 8, // big = len
String = 9, // big = len
Tuple = 10, // big = len
Map = 11, // big = len
Array = 12, // big = len
// 13-15 reserved for future prefixed length native types
// decoders can assume they will always have big = len
};
Refs
Refs are simply numerical references into some out-of-band table of values. It allows compression when the code reading the nibs structure had context about what the references are.
In the proposed text format these are encoded as &[0-9]+
. In the binary format these get a new type field.
In both cases, the data is a single u64
(decimal in text or big half of nibs pair in binary).
Libraries are expected to have an API where clients can register values to indices and when decoding these refs the values are looked up automatically.
Tags
Tags are custom userspace types. They are encoded similar to refs except they prefix another existing nibs value (including another tag recursively if desired).
In the proposed text format, these are encoded as @[0-9]+
. In the binary format these get a new type field.
In both cases, the data is a single u64
(decimal in text or big half of nibs pair in binary).
Libraries are expected to have an API where clients can register custom data types as encoder and decoder functions.
For example a custom type can be a bloom filter using a tag and a large byte array. The custom decoder can return a map-like interface that returns true or false when indexing a key, but has no keys if iteration is attempted or length if queried.
Arrays
An array is a variant of the existing List
type, but after the nibs pair and before the list of values there is another nibs pair header and an array of fixed-sized offset pointers.
This allows a decoder trying to index the Nth item in the array to jump directly to the offset pointer and then jump directly to the value giving O(1) lookups instead of the normal O(n) for lists.
The header nibs pair has:
- small is the byte width of the pointers
- big is the count of items in the index
There is no point in having an index in the proposed textual representation, but we do want to preserve the type information so these are encoded as #[...]
instead of the normal []
JSON syntax. When an encoder converts these back to nibs, it needs to generate and insert the array index.
In the binary format a new top-level type is allocated next to the other containers.
One question that comes to mind is if Tag
could be used to provide any of these types dynamically (instead of being baked-in), but given their functionality I'm not sure if that's an option.
Yes, tag can be used to implement any type in userspace, but I prefer array in core since it's simple and very useful.
I'm considering adding back in the HAMT as well, but not the hashing. The input is just the 64 bit integer and the full integer is stored in the payload.
edit: nevermind, this is too complex for core for now.
While working through the text format, I think it would make sense to rename list
to tuple
to match the proposed (
...)
syntax.
Ok, made some progress on this in the PR. The current changes are two main categories:
-
Smaller wire size for typical string heavy JSON documents.
- New
Ref
andRefScope
types to enable only storing repeated values once. - New
HexString
type to compress storage of hex strings. - New
StringChain
type to allow parts of a string to be stored differently (for example mixing refs, strings, and hexstrings)
- New
-
Indexes for faster lookups in large container values
- New
Array
type that gives O(1) indexing into lists using a flat pointer array index. - New
Trie
type that gives O(log n) indexing into maps using a HAMT index.
- New
I decided to not add Tag
at this time.
I'm also considering a couple new types that are a more efficient version of Trie
The problem with the current Trie
design is you still have to store the full keys in the map portion. There are some use cases where this isn't needed and a more compact representation would be preferred. One idea is a new SparseArray
type where each array pointer also has an integer index associated with it. These can be hashes of the keys. Finding a value can be done in O(log n) time using binary search through the array of hashes and pointer. It's not clear how this would map to the text format or client language semantics.
So for the SparseArray
type, a value would need extra parameters like Array
and Trie
, but we actually need 3 numbers for pointer-width
, array-count
, and key-width
. The pointer-width
and array-count
work exactly like the Array
type, but next to each fixed width pointer is a fixed width integer that is the sparse index for that entry. This is a different width than the offsets.
There is maybe enough in the 4-bit small number in the nibs pair for both pointer width and index width if we encode differently.
For example, consider a simple example with 8-bit pointers, 64-bit hashes, and 4 entries. The index data would be written in rows 9 bytes wide.
HHHHHHHHP
HHHHHHHHP
HHHHHHHHP
HHHHHHHHP
We could limit to powers of 2 widths and need less bits for the widths. Pointers only need to be 1, 2, 4, or 8 bytes wide. Hashes probably want more range, but we could make do with the same range if needed. If the sparse indexes (aka hashes) had the same range, then we only need 2 bits for each.
00 00 // hash: 1 pointer: 1
00 01 // hash: 1 pointer: 2
00 10 // hash: 1 pointer: 4
00 11 // hash: 1 pointer: 8
01 00 // hash: 2 pointer: 1
01 01 // hash: 2 pointer: 2
01 10 // hash: 2 pointer: 4
01 11 // hash: 2 pointer: 8
10 00 // hash: 4 pointer: 1
10 01 // hash: 4 pointer: 2
10 10 // hash: 4 pointer: 4
10 11 // hash: 4 pointer: 8
11 00 // hash: 8 pointer: 1
11 01 // hash: 8 pointer: 2
11 10 // hash: 8 pointer: 4
11 11 // hash: 8 pointer: 8
We could even bias or customize the hash widths to support a larger range. 8 bit hashes aren't worth much in practice. Common hash sizes are:
- 64 bit (xxhash64)
- 128 bit (MD5, xxhash128)
- 160 bit (SHA1)
- 256 bit (SHA256 and lots of modern hashes)