/canonicaljson-spec

Specification of canonical-form JSON for equivalence comparison.

Primary LanguageAwk

JSON Canonical Form

RFC 7159 defines JSON as "a text format for the serialization of structured data", but allows many distinct serializations to describe the same data. Such human-friendly flexibility can hinder machine treatment of JSON text, particularly when it is used as input for cryptographic hash functions that are expected to yield identical results for logically equivalent input (as is the case in computation of digital signatures). This specification defines a unique canonical form for every JSON value, the result being safe for comparison (in that logically equivalent structured data are guaranteed to have the same canonical form).

Table of contents

Definition

JSON text in canonical form:

  1. MUST be encoded in UTF-8
  2. MUST NOT include insignificant (i.e., inter-token) whitespace (defined in section 2 of RFC 7159)
  3. MUST order the members of all objects lexicographically by the UCS (Unicode Character Set) code points of their names
    1. preserving and utilizing the code points in U+D800 through U+DFFF (inclusive) for all lone surrogates
  4. MUST represent all integer numbers (those with a zero-valued fractional part)
    1. without a leading minus sign when the value is zero, and
    2. without a decimal point, and
    3. without an exponent, and
    4. without insignificant leading zeroes (as already required of all JSON numbers)
  5. MUST represent all non-integer numbers in exponential notation
    1. including a nonzero single-digit significand integer part, and
    2. including a nonempty significand fractional part, and
    3. including no trailing zeroes in the significand fractional part (other than as part of a ".0" required to satisfy the preceding point), and
    4. including a capital "E", and
    5. including no plus sign in the exponent, and
    6. including no insignificant leading zeroes in the exponent
  6. MUST represent all strings (including object member names) in their minimal-length UTF-8 encoding
    1. avoiding escape sequences for characters except those otherwise inexpressible in JSON (U+0022 QUOTATION MARK, U+005C REVERSE SOLIDUS, and ASCII control characters U+0000 through U+001F) or UTF-8 (U+D800 through U+DFFF), and
    2. avoiding escape sequences for combining characters, variation selectors, and other code points that affect preceding characters, and
    3. using two-character escape sequences where possible for characters that require escaping:
      • \b U+0008 BACKSPACE
      • \t U+0009 CHARACTER TABULATION ("tab")
      • \n U+000A LINE FEED ("newline")
      • \f U+000C FORM FEED
      • \r U+000D CARRIAGE RETURN
      • \" U+0022 QUOTATION MARK
      • \\ U+005C REVERSE SOLIDUS ("backslash"), and
    4. using six-character \u00xx uppercase hexadecimal escape sequences for control characters that require escaping but lack a two-character sequence, and
    5. using six-character \uDxxx uppercase hexadecimal escape sequences for lone surrogates

Example

{"-0":0,"-1":-1,"0.1":1.0E-1,"1":1,"10.1":1.01E1,"emoji":"😃","escape":"\u001B","lone surrogate":"\uDEAD","whitespace":" \t\n\r"}

Implementations

The following projects are known to correctly implement this specification:

If you know of any others, please submit a pull request to add them!

Validation

This repository can be used to validate any implementation.

  1. Get a local copy of this repository.
  2. Create an executable that uses a candidate implementation to output the JSON canonical form of the contents of its first argument, exiting with a status of 0 if and only if the conversion was successful.
  3. Invoke ./test.sh /path/to/executable from this repository, substituting the path to the above executable in the first argument.
  4. test.sh will provide known input and look for expected output, printing the results, exiting with a status of 0 if and only if the executable (and therefore the candidate implementation) adheres to this specification.

Prior Art

This specification updates the expired JSON Canonical Form internet draft to ensure a unique canonical representation of every JSON value.

Representation of non-integer numbers still matches the canonical float representation from section 3.2.4.2 of XML Schema Datatypes, but integer numbers now have a non-exponential representation matching integer (section 3.3.13.2) and RFC 7638 JSON Web Key (JWK) Thumbprint.

The treatment of strings generalizes section 3.3 of RFC 7638 and Keybase canonical JSON packing (both of which cryptographically hash JSON text) to cover the full range of Unicode characters.

OLPC "Canonical JSON" (which is also intended to support meaningful hashes of structured data) describes a format that is not actually JSON, because its strings are sequences of bytes rather than sequences of Unicode code points (e.g., the tab-containing string " " is conforming OLPC "Canonical JSON" but not JSON and "\t" is conforming JSON but not OLPC "Canonical JSON"). But where they overlap, this specification generalizes OLPC "Canonical JSON" to include floating point numbers and revises it for Unicode-aware string sorting.

Changelog

v1.0.0 (2017-10-17)

  • Specifed uppercase Unicode escape sequences, to match RFC 7159.

v1.0.1 (2018-07-01)

  • Updated prettyjson.awk utility for greater compatibility with non-GNU awk implementations.

v1.0.2 (2019-04-14)

  • Explicitly mentioned the prohibition of insignificant leading zeroes from RFC 7159.