/jsonsl

Embeddable, Fast, Streaming, Non-Buffering JSON Parser

Primary LanguageCMIT LicenseMIT

JSONSL

JSON Stateful (or Simple, or Stacked, or Searchable, or Streaming) Lexer

Why another (and yet another) JSON lexer?

I took inspiration from some of the uses of YAJL, which looked quite nice, but whose build system seemed unusable, source horribly mangled, and grown beyond its original design. In other words, I saw it as a bunch of cruft.

Instead of bothering to spend a few days figuring out how to use it, I came to a conclusion that the tasks I needed (simple token notifications coupled with some kind of state shift detection), I could do with a simple, small, ANSI C embeddable source file.

I am still not sure if YAJL provides the featureset of JSONSL, but I'm guessing I've got at least some innovation.

JSONSL

Inspiration was also taken from Joyent's http-parser project, which seems to use a similar, embeddable, and simple model.

Here's a quick featureset

Stateful

Maintains state about current descent/recursion/nesting level Furthermore, you can access information about 'lower' stacks as long as they are activ.

Decoupling Object Graph from Data

JSONSL abstracts the object graph from the actual (and usually more CPU-intensive) work of actually populating higher level structures such as "hashes" and "arrays" with "decoded" and "meaningful" values. Using this, one can implement an on-demand type of conversion.

Callback oriented, selectively

Invokes callbacks for all sorts of events, but you can control which kind of events you are interested in receiving without writing a ton of wrapper stubs

Non-Buffering

This doesn't buffer, copy, or allocate any data. The only allocation overhead is during the initialization of the parser, in which the initial stack structures are initialized

Simple

Just a C source file, and a corresponding header file. ANSI C.

While attempts will be made to add functionality and reduce boilerplate in your code, the core functions are simple and clearly defined.

Add-ons (see below) are available (and exist in the same jsonsl.c file)

JSONPointer search add-on

Use JSONPointer to query JSON streams as they arrive. Quite efficient, and very simple (see jpr_test.c for examples)

Unescaping utility add-on

Includes a nice little function which can flexibly unescape JSON strings to match your specifications.

The rest of this documentation needs work

Details

Terminology

Because the JSON spec is quite confusing in its terminology, especially when we want to map it to a different model, here is a listing of the terminology used here.

I will use element, object, state interchangeably. They all refer to some form of atomic unit as far as JSON is concerned.

I will use the term hash for those things which look like {"foo":"bar"}, and refer to its contents as keys and values

I will use the term list for those things which look like ["hello", "byebye"], and their contents as list elements or array elements explicitly

Model

States

A state represents a JSON element, this can be a a hash (T_OBJECT), array (T_LIST), hash key (T_HKEY), string (T_STRING), or a 'special' value (T_SPECIAL) which should be either a numeric value, or one of true, false, null.

A state comprises and maintains the following information

Type

This merely states what type it is - as one of the JSONSL_T_* constants mentioned above

Positioning

This contains positioning information mapping the location of the element as an offset relative to the input stream. When a state begins, its start position is set. Whenever control returns back to the state, its current position is updated and set to the point in the stream when the return occured

Extended Information

For non-scalar state types, information regarding the number of children contained is stored.

User Data

This is a simple void* pointer, and allows you to associate your own data with a given state

Stack

A stack consists of multiple states. When a state begins, it is pushed to the stack, and when the state terminates, it is popped from the stack and returns control to the previous stack state.

When a state is popped, the contained information regarding positioning and children is complete, and it is therefore possible to retrieve the entire element in its byte-stream.

Once a state has been popped, it is considered invalid (though it is still valid during the callback).

Below is a diagram of a sample JSON stream annotated with stack/state information.

Level 0
   {

   Level 1

       Level 2
           "ABC"
       :
       Level 2
           "XYZ"
       ,

   Level 1

       [
       Level 2

           {
           Level 3

               Level 4
               "Foo":"Bar"

           Level 3
           }
       Level 2
       ]
   Level 1
   }

USING

The header file jsonsl.h contains the API. Read it.

As an additional note, you can 'extend' the state structure (thereby eliminating the need to allocate extra pointers for the void *data field) by defining the JSONSL_STATE_USER_FIELDS macro to expand to additonal struct fields.

This is assumed as the default behavior - and should work when you compile your project with jsonsl.c directly.

If you wish to use the 'generic' mode, make sure to #define or -D the JSONSL_STATE_GENERIC macro.

Some notes regarding usage will follow:

Position and Offset Tracking

The state object contains some pos variables. These variables contain the position relative to the amount of total bytes that the jsonsl_t object has been fed since creation (or since reset) has been called. Thus, in order to make sense of these variables, you must do one of two things

This way, the offsets declared in the jsn->pos and pos_begin variables can be directly applied as offsets to the actual buffer.

Of course this is not the recommended option; since jsonsl is a streaming parser, you are likely using it because you don't want to buffer the entire stream

Note the first valid position in the existing buffer

This technique requires the user to keep track of the first valid position within the current buffer. This is useful for tracking the beginnings and ends of strings.

Typically you will need a simple function or macro and some variables which do the following:

  • Contain the minimum valid position in the buffer, e.g. min_available

    This is initially set to 0, and increases as we discard data (see later)

  • Allow callbacks to request an advancement of the position. This means that your context object contains a "min_needed" variable. For example, one might have a PUSH callback for the beginning of a string. The push callback will set the min_needed variable to the position of the beginning of the string (i.e. state->pos_begin). In a corresponding POP callback, the string is read from an internal buffer (whose first valid position is no greater than state->pos_begin) with a length of jsn->pos - state->pos_begin bytes. Once the string is read, it is no longer needed, and the callback then updates the min_needed variable to the parser's pos.

  • After jsonsl_feed is called, determine if the input buffer needs to be adjusted. This means to determine whether the min_needed variable has been set to something larger than the min_available variable. If this condition is true, it means part of the buffer can be discarded. The amount of bytes to discard from the beginning will be the difference between these two variables. The length of the buffer also becomes shorted by the difference.

    Once the bytes are discarded (one can use a simple memmove), the min_available variable is set to the min_needed value.

  • To demonstrate this, let's make a sample structure:

    struct parse_context {
        size_t min_needed;
        size_t min_available;
    
        char *buffer;
        size_t buffer_len;
    }

    The buffer is the buffer which is available to the callbacks (e.g. by making this struct be the value of the data field in the jsonsl_t).

    It is possible to write a simple function which will get a slice of the buffer, given the absolute offsets from the state variable:

    void get_state_buffer(struct parse_context *ctx, struct jsonsl_state_st *state)
    {
        size_t offset = state->pos_begin - ctx->min_available;
        return ctx->buffer + offset;
    }

    Of course this function would probably like to do some error checking to ensure that for example, the state's pos_begin is not less than the min_available of the ctx.

Notes on String States

It is possible to get the length of a string by getting the difference between its state's pos_begin and the parser's pos variable. it should be noted that the pos_begin points to the position of the opening " (quote) and the pos points to the position of the closing " (quote). Thus to get the actual raw string, one must increase the buffer pointer and decrease the length.

The logic may be encapsulated in a macro

#define NORMALIZE_OFFSETS(buf, len) (buf)++; (len)++;
/* use it */
char *buf = get_state_buffer(ctx, state);
size_t len = jsn->pos - state->pos_begin;
NORMALIZE_OFFSETS(buf, len);

Note that care should be taken not to perform this on literals like numbers, booleans, and nulls.

Notes on jsonpointer

The jsonpointer implementation is designed to work with a stream and works very nicely with the callbacks. It relies on the caller incrementally providing jpr with information (e.g. via jsonsl_jpr_match_state) about each element in the JSON tree.

Interally it builds a graph based on inputs from each item in the JSON tree; relying on the fact that an item will only be a MATCH_POSSIBLE or MATCH_COMPLETE if its parent was also a MATCH_POSSIBLE.

Thus the jpr functions must be fed with hierarchical data and information.

In general, items need their key information. The key exists as the following

Object (dictionary) values are passed with their keys

This means you must buffer the keys

Array elements are passed with their indices

jsonsl_jpr_match_state does this for you automatically

Primitives without any children are not passed

jsonpointer only makes sense when searching for data under a key. Passing a primitive (i.e. boolean, number, or non-key string) does not make sense

UNICODE

While JSONSL does not support unicode directly (it does not decode \uxxx escapes, nor does it care about any non-ascii characters), you can compile JSONSL using the JSONSL_USE_WCHAR macro. This will make jsonsl iterate over wchar_t characters instead of the good 'ole char. Of course you would need to handle processing the stream correctly to make sure the multibyte stream was complete.

NaN, Infinity, -Infinity

By default, JSONSL does not consider objects like {"n": NaN}, {"n": Infinity}, or {"n": -Infinity} to be valid. Compile with JSONSL_PARSE_NAN defined to parse these non-numbers. JSONSL will then execute your POP callback with state->special_flags set to JSONSL_SPECIALf_NAN when it parses NaN, JSONSL_SPECIALf_INF for Infinity, and JSONSL_SPECIALf_INF | JSONSL_SPECIALf_SIGNED for -Infinity.

WINDOWS

JSONSL Now has a visual studio .sln and .vcxproj files in the vs directory.

If you wish to use JSONSL as a DLL, be sure to define the macro JSONSL_DLL which will properly decorate the prototypes with __declspec(dllexport).

You can also run the tests on windows using the jsonsl-tests project. You will need to manually pass in the sample input files to be tested, however. In the future, I hope to automate this process.

API AND ABI STABILITY, AND PACKAGING NOTES

JSONSL will attempt to maintain a stable API, but not a stable ABI.

The general distribution model of JSONSL (A single source file and a single header) is designed in such a manner to allow the application to embed the relevant parts in the application.

For speed benefits it may also be desirable to actually have the jsonsl.c file embedded in the same translation unit as the user-side code calling into JSONSL itself. In such use cases, the JSONSL_API macro may be defined as static inline.

AUTHOR AND COPYRIGHT

Copyright (C) 2012-2017 Mark Nunberg.

See LICENSE for license information.