[Proposal] Add Text Format based on superset of JSON
creationix opened this issue · 21 comments
Nibs is primarily a binary format to enable fast parsing and random access. But sometimes it's really nice to have a textual way to visualize or specify the data.
This text format is a superset of JSON extended to support all of nibs's types:
-
Integer
andNegativeInteger
are both stored in JSON decimal integers. -
FloatingPoint
is stored as decimal, but always includes the decimal and at least 1 digit on both sides of it. -
Bytes
is stored as<XXXXXX>
whereXXXXXX
is the raw bytes encoded in hexadecimal (uppercase or lowercase allowed) -
String
is stored as normal JSON string syntax. -
Array
(proposed type) is a JSON array. -
Map
is stored as normal JSON object syntax, the restriction that keys can only be strings is lifted and any type can be used as a key. -
Tuple
(akaList
) is stored as normal JSON array syntax, but using parentheses(
...)
. -
False
andTrue
are stored JSON boolean -
Nil
is stored as JSON null -
NaN
,Infinity
,-Infinity
are stored asnan
,inf
, and-inf
-
Ref
(proposed type) is ampersand + decimal number&[1-9][0-9]*
-
Tag
(proposed type) is at symbol + decimal number![1-9][0-9]* value
Additionally to make it a good format for writing configuration documents a couple quality-of-life enhancements are added. -
JavaScript style line and block comments are allowed and ignored
-
trailing commas in objects and arrays are tolerated by readers
An example document:
{
// keys can be integers
1: "yes",
[ 1, 2, 3 ]: "How about complex structures for keys?",
// some string keys don't need quotes
isCool: true,
binary: <e9cb1ffede0347ad7b15088dcad361caf5f2487e>,
content-type: "text/nibs",
}
This would make it a lot easier to encourage adoption (especially around web tech). I'd like to explore ways to make the JSON output survive digestion by nibs-unaware JSON parsers but that could come in the form of additional post-processing or just optional generation flags.
So there could be a pure JSON output mode with tradeoffs in it's design between preserving all semantics or producing more vanilla/compact JSON.
- integers just work
- booleans just work
- null just works
- floats just work (assuming no NaN/Infinity/-Infinity), we can keep the rule to always encode the decimal point.
- Strings just work as long as they are always quoted
- Binary has options:
- store as hex string (note it can't be automatically converted back to binary)
- store as hex string, but wrapped in a tagging object
{"__BINARY__":"XXXX"}
- List just works as JSON array
- Map just works as JSON object if all keys are strings, otherwise we have options:
- convert all keys to strings
- encode as array of arrays (key/value pairs) (with optional tagging wrapping)
- encode as array of alternating key/value pairs (with optional tagging wrapping)
- Array can just encode as JSON array
- it can be wrapped with a tagging object to signify it needs to be an array type
If someone were to use the JSON tagging system they would need a way to also escape valid data that happens to collide with the tags.
For now I think the encoder should default to the lossy options when encoding as JSON to keep things simple.
A library could have a .toString()
method that turns a nibs value into a nibs-text encoding. But it could also have a .toJson()
method that emits valid JSON using the lossy methods.
Another flag that might be useful is ASCII mode for the text encoding. I recently learned that if you give aws S3 meta fields non-ascii data it will be encoded using rfc2047. This is best to avoid since JSON already have a method for escaping unicode characters.
In this proposed ASCII-only encoding, and non-ascii characters will be encoded as \uxxxx
in JSON. Any character higher than fits in the 16 bit index can be encoded using surrogate pairs.
But by default the nibs text format should leave unicode characters in native utf-8 encoding and not use JSON escaping.
What's the difference between binary data and strings?
Another flag that might be useful is ASCII mode for the text encoding.
I hadn't considered this but I completely agree. I have selfish reasons for wanting this but from an interop perspective I think it's very practical and I also think it aligns with the purpose of the text format itself.
A library could have a
.toString()
method that turns a nibs value into a nibs-text encoding. But it could also have a.toJson()
method that emits valid JSON using the lossy methods.
I think this is a good way to "nudge" consumers toward using the higher-fidelity text format without surprising anyone looking for parsable JSON.
What's the difference between binary data and strings?
In the binary encoding, the only difference is a different type tag. In the text encoding they are very different. This type tag is very useful for languages that have different types for binary and strings since most languages have some sort of unicode capability in strings.
For example in JavaScript, strings can be normal String
values, but binary can be represented as ArrayBuffer
or Uint8Array
or node Buffer
depending on what's common to that library's usage.
Even in lua where strings are technically 8-bit binary data, it's a good convention to only use strings for textual data and always encode it as utf-8 in the binary lua string
. Then binary data can be represented using a luajit cdata like uint8_t[?]
which is essentially a fixed byte array.
In both JS and Lua, strings are interned and immutable values, but binary is non-interned and mutable. They are very different types from the language's point of view.
Technically you could store arbitrary binary data in JSON strings by simply encoding the 8-bit values as matching unicode code points. In the early node.js days we called this hack "raw" encoding. You would just have to know that a unicode string is actually binary data and do the conversion when you need the raw bytes.
These extended values from 128-255 can be encoded as normal UTF-8 in the JSON string or if you're encoding in ASCII mode they can use \uXXXX
encoding.
For example, the nibs-text value <deadbeef>
encoded as a "raw" string which can be represented as either ASCII or UTF-8 JSON.
> rawAscii = '"\\u00de\\u00ad\\u00be\\u00ef"'
'"\\u00de\\u00ad\\u00be\\u00ef"'
> rawAscii.length
26
> a = JSON.parse(rawAscii)
'Þ¾ï'
> a.length
4
> a.charCodeAt(0).toString(16)
'de'
> a.charCodeAt(1).toString(16)
'ad'
> a.charCodeAt(2).toString(16)
'be'
> a.charCodeAt(3).toString(16)
'ef'
> Buffer.from(a) // While it's length is 4, it's actually 8 bytes when encoded as UTF-8
<Buffer c3 9e c2 ad c2 be c3 af>
> rawUtf8 = JSON.stringify(a) // by default, JSON.stringify uses utf8-encoding
'"Þ¾ï"'
> rawUtf8.length
6
> Buffer.byteLength(rawUtf8)
10
Note that the nibs-text encoding for binary is ASCII safe so it only costs 2x to encode. The ASCII safe version of JSON strings if using raw encoding costs 6x for the \uXXXX
format. The utf-8 JSON encoding is slightly better than hex since lower values only cost one byte and higher values cost two bytes in utf-8, but the output is very ugly and dangerous to copy-paste.
Binary could also be encoded in JSON as hex strings or base64 strings. In all cases the consumer would be missing the type tag and would need to know if it's supposed to be interpreted as binary which is why the nibs-text format is preferred when possible.
Here is the full proposed syntax as a railroad diagram. I'm not happy about the amount of duplication between integer
and float
. This will require some state and/or lookahead in parsers.
Also note the change in list
vs array
where the JSON array syntax maps to nibs array and nibs list
is renamed tuple
since it's using parenthesis.
Hmm, this is still not good enough. The strings without quotes conflict with the keyword based values true
,false
,null
,nan
,inf
. Maybe they can use a $
prefix or just be dropped?
Also I forgot nan
, inf
, and -inf
in the diagram.
String without quotes is a pita, half the yaml has quotes anyway
Yeah, let's just remove it. Less is more.
Initial stab at text spec here https://github.com/creationix/nibs/blob/add-text-format/docs/text-format.md
Proposed spec in PR
#8
We should also have a text format for disassembled nibs to enable tools like this https://geraintluff.github.io/cbor-debug/
The 3 formats should be able to be converted between each other.
- text (json-like, best for humans, but doesn't encode indices and so loses some information)
- binary (best for computers)
- assembly (disassembled binary, preserves exact indices types and displays human readable)