creationix/nibs

[Proposal] Add Text Format based on superset of JSON

creationix opened this issue · 21 comments

Nibs is primarily a binary format to enable fast parsing and random access. But sometimes it's really nice to have a textual way to visualize or specify the data.

This text format is a superset of JSON extended to support all of nibs's types:

  • Integer and NegativeInteger are both stored in JSON decimal integers.

  • FloatingPoint is stored as decimal, but always includes the decimal and at least 1 digit on both sides of it.

  • Bytes is stored as <XXXXXX> where XXXXXX is the raw bytes encoded in hexadecimal (uppercase or lowercase allowed)

  • String is stored as normal JSON string syntax.

  • Array (proposed type) is a JSON array.

  • Map is stored as normal JSON object syntax, the restriction that keys can only be strings is lifted and any type can be used as a key.

  • Tuple (aka List) is stored as normal JSON array syntax, but using parentheses ( ... ).

  • False and True are stored JSON boolean

  • Nil is stored as JSON null

  • NaN, Infinity, -Infinity are stored as nan, inf, and -inf

  • Ref (proposed type) is ampersand + decimal number &[1-9][0-9]*

  • Tag (proposed type) is at symbol + decimal number ![1-9][0-9]* value
    Additionally to make it a good format for writing configuration documents a couple quality-of-life enhancements are added.

  • JavaScript style line and block comments are allowed and ignored

  • trailing commas in objects and arrays are tolerated by readers

An example document:

{
  // keys can be integers
  1: "yes",
  [ 1, 2, 3 ]: "How about complex structures for keys?",
  // some string keys don't need quotes
  isCool: true,
  binary: <e9cb1ffede0347ad7b15088dcad361caf5f2487e>,
  content-type: "text/nibs",
}
jjg commented

This would make it a lot easier to encourage adoption (especially around web tech). I'd like to explore ways to make the JSON output survive digestion by nibs-unaware JSON parsers but that could come in the form of additional post-processing or just optional generation flags.

So there could be a pure JSON output mode with tradeoffs in it's design between preserving all semantics or producing more vanilla/compact JSON.

  • integers just work
  • booleans just work
  • null just works
  • floats just work (assuming no NaN/Infinity/-Infinity), we can keep the rule to always encode the decimal point.
  • Strings just work as long as they are always quoted
  • Binary has options:
    • store as hex string (note it can't be automatically converted back to binary)
    • store as hex string, but wrapped in a tagging object {"__BINARY__":"XXXX"}
  • List just works as JSON array
  • Map just works as JSON object if all keys are strings, otherwise we have options:
    • convert all keys to strings
    • encode as array of arrays (key/value pairs) (with optional tagging wrapping)
    • encode as array of alternating key/value pairs (with optional tagging wrapping)
  • Array can just encode as JSON array
    • it can be wrapped with a tagging object to signify it needs to be an array type

If someone were to use the JSON tagging system they would need a way to also escape valid data that happens to collide with the tags.

For now I think the encoder should default to the lossy options when encoding as JSON to keep things simple.

A library could have a .toString() method that turns a nibs value into a nibs-text encoding. But it could also have a .toJson() method that emits valid JSON using the lossy methods.

Another flag that might be useful is ASCII mode for the text encoding. I recently learned that if you give aws S3 meta fields non-ascii data it will be encoded using rfc2047. This is best to avoid since JSON already have a method for escaping unicode characters.

In this proposed ASCII-only encoding, and non-ascii characters will be encoded as \uxxxx in JSON. Any character higher than fits in the 16 bit index can be encoded using surrogate pairs.

But by default the nibs text format should leave unicode characters in native utf-8 encoding and not use JSON escaping.

What's the difference between binary data and strings?

jjg commented

Another flag that might be useful is ASCII mode for the text encoding.

I hadn't considered this but I completely agree. I have selfish reasons for wanting this but from an interop perspective I think it's very practical and I also think it aligns with the purpose of the text format itself.

jjg commented

A library could have a .toString() method that turns a nibs value into a nibs-text encoding. But it could also have a .toJson() method that emits valid JSON using the lossy methods.

I think this is a good way to "nudge" consumers toward using the higher-fidelity text format without surprising anyone looking for parsable JSON.

What's the difference between binary data and strings?

In the binary encoding, the only difference is a different type tag. In the text encoding they are very different. This type tag is very useful for languages that have different types for binary and strings since most languages have some sort of unicode capability in strings.

For example in JavaScript, strings can be normal String values, but binary can be represented as ArrayBuffer or Uint8Array or node Buffer depending on what's common to that library's usage.

Even in lua where strings are technically 8-bit binary data, it's a good convention to only use strings for textual data and always encode it as utf-8 in the binary lua string. Then binary data can be represented using a luajit cdata like uint8_t[?] which is essentially a fixed byte array.

In both JS and Lua, strings are interned and immutable values, but binary is non-interned and mutable. They are very different types from the language's point of view.

Technically you could store arbitrary binary data in JSON strings by simply encoding the 8-bit values as matching unicode code points. In the early node.js days we called this hack "raw" encoding. You would just have to know that a unicode string is actually binary data and do the conversion when you need the raw bytes.

These extended values from 128-255 can be encoded as normal UTF-8 in the JSON string or if you're encoding in ASCII mode they can use \uXXXX encoding.

For example, the nibs-text value <deadbeef> encoded as a "raw" string which can be represented as either ASCII or UTF-8 JSON.

> rawAscii = '"\\u00de\\u00ad\\u00be\\u00ef"'
'"\\u00de\\u00ad\\u00be\\u00ef"'
> rawAscii.length
26
> a = JSON.parse(rawAscii)
'Þ­¾ï'
> a.length
4
> a.charCodeAt(0).toString(16)
'de'
> a.charCodeAt(1).toString(16)
'ad'
> a.charCodeAt(2).toString(16)
'be'
> a.charCodeAt(3).toString(16)
'ef'
> Buffer.from(a) // While it's length is 4, it's actually 8 bytes when encoded as UTF-8
<Buffer c3 9e c2 ad c2 be c3 af>
> rawUtf8 = JSON.stringify(a) // by default, JSON.stringify uses utf8-encoding
'"Þ­¾ï"'
> rawUtf8.length
6
> Buffer.byteLength(rawUtf8)
10

Note that the nibs-text encoding for binary is ASCII safe so it only costs 2x to encode. The ASCII safe version of JSON strings if using raw encoding costs 6x for the \uXXXX format. The utf-8 JSON encoding is slightly better than hex since lower values only cost one byte and higher values cost two bytes in utf-8, but the output is very ugly and dangerous to copy-paste.

Binary could also be encoded in JSON as hex strings or base64 strings. In all cases the consumer would be missing the type tag and would need to know if it's supposed to be interpreted as binary which is why the nibs-text format is preferred when possible.

Extended string encoding as railroad diagram.
image

Float is the same as JSON, except the fractional part is not optional.
image
Integer is any integer in decimal or hex or octal or binary
image

binary is simple:
image

Here is the full proposed syntax as a railroad diagram. I'm not happy about the amount of duplication between integer and float. This will require some state and/or lookahead in parsers.

Also note the change in list vs array where the JSON array syntax maps to nibs array and nibs list is renamed tuple since it's using parenthesis.

value
whitespace

Hmm, this is still not good enough. The strings without quotes conflict with the keyword based values true,false,null,nan,inf. Maybe they can use a $ prefix or just be dropped?

Also I forgot nan, inf, and -inf in the diagram.

This is what it looks like with the $ added in (and the missing floats added. At this point I don't see enough value in string without quotes and should probably remove it. The other option is the spec could be like JavaScript and allow any string that's not a keyword?

value

String without quotes is a pita, half the yaml has quotes anyway

Yeah, let's just remove it. Less is more.

I also removed the hex/binary/octal encoding and was able to merge the two number types.

combined

Proposed spec in PR
#8

We should also have a text format for disassembled nibs to enable tools like this https://geraintluff.github.io/cbor-debug/

The 3 formats should be able to be converted between each other.

  • text (json-like, best for humans, but doesn't encode indices and so loses some information)
  • binary (best for computers)
  • assembly (disassembled binary, preserves exact indices types and displays human readable)