Compression for serialized objects

Question

Compression for serialized objects

dselman opened this issue 5 months ago · 1 comments

dselman commented 5 months ago

Feature Request 🛍️

Support compression of serialised Concerto objects.

Use Case

ASTs and serialised objects in general are verbose. They compress well due to repeated JSON properties, like $class.

Possible Solution

Provide compress/decompress functions within Concerto core or util.

Context

Working with large models getting HTTP timeouts or other storage issues

Detailed Description

Two approaches, which may be complimentary have been explored.

Class Map

This specifically targets the $class properties within the JSON objects produced by the Serializer. The JSON tree is visited to build a Map of all $class values in the JSON. $class entries that start with the same prefix as the root $class are shortened by removing the common prefix.

This map is used to replace the $class properties with indexes into the map, resulting in a JSON object that looks like:

{
  "$class": "1",
  "models": [
    {
      "$class": "2",
      "decorators": [],
      "namespace": "test@1.0.0",
      "imports": [],
      "declarations": [
        {
          "$class": "3",
          "name": "SSN",
          "location": {
            "$class": "4",
            "start": {
              "offset": 22,
              "line": 3,
              "column": 1,
              "$class": "5"
            },
            "end": {
              "offset": 124,
              "line": 9,
              "column": 1,
              "$class": "5"
            }
          },
}],
"$version": 1,
  "$classMap": {
    "1": ".Models",
    "2": ".Model",
    "3": ".StringScalar",
    "4": ".Range",
    "5": ".Position",
    "6": ".Decorator",
    "7": ".ConceptDeclaration",
    "8": ".StringProperty",
    "9": ".DecoratorString",
    "10": ".ObjectProperty",
    "11": ".TypeIdentifier",
    "12": ".IntegerProperty",
    "13": ".MapDeclaration",
    "14": ".StringMapKeyType",
    "15": ".StringMapValueType",
    "16": ".EnumDeclaration",
    "17": ".EnumProperty"
  },
  "$prefix": "concerto.metamodel@1.0.0"
}

LZ Compression

LZ compression is used on the JSON object (either the source object as-is, or the object after the Class Map has been built). Resulting in a JSON object that looks like:

{
  "compressed": "ᯡࠩƬ΀䌦㧤Ɛ䄣氧ァ☢㠥暠㨡㛻熤娠䷒䀠䁦ᄠ᥺၌䛛ࠣK嚴≄ú ",
  "format": "LZ_UTF16"
}

Results

Class Map: approximately 1.6x compression
ClassMap + LZ: approximately 12x compression
Just LZ: approximately 10x compression

Answer 1 · 2024-07-25T16:43:04.000Z

For LZ compression, how is the byte stream converted into a string in your example? I think you'd want to consider two things.

First, you may want to consider the production of strings that would be invalid in UTF-16 (e.g. with unpaired surrogates), which is commonly used by languages to hold strings in memory. Invalid UTF-16 strings aren't necessarily bad, because they rarely cause actual problems, but it's worth investigating if you're going to be using high Unicode characters like that.
Second, given that UTF-8 is pretty much the ubiquitous standard for text encoding in storage and transmission, I think you'd want to be careful that the encoding of the compressed bytes into a string produces Unicode code points that will encode efficiently into UTF-8. For example, code points from 0-127 encode into a single byte, wasting 12.5% of the bits. Code points from 128 to 2047 encode into two bytes, wasting 31.25% of the bits. Code points from 2048 to 65535 encode into three bytes, wasting 33.33% of the bits, and higher ones waste over 36% IIRC. Basically, you'd be better off encoding everything into low ASCII, except that low ASCII has 34 characters that must be escaped in JSON (making them take either two or six bytes instead of one), including the NUL character which causes problems with many systems (given that much code treats NULs as terminators). So, you should exclude from the alphabet those characters that must be escaped, especially NUL. (For example, Postgres can't store a string containing NUL.) Excluding those 34 characters means low ASCII wastes almost 18.1% of the bits, but that minimum requires a sophisticated algorithm to achieve. Considering only simple algorithms, base-64 wastes 25% and base-85 wastes 20%. 20% is close enough to 18.1%, so simply using base-85 (whose standard alphabet requires no escaping in JSON) is probably your best bet.

That said, if you're considering LZ-type compression at all, you may consider storing the result natively in binary if the storage system can handle it, rather than encoding it into a string and then encoding the string into JSON and then encoding the JSON into UTF-8. Storing as binary wastes 0% of the bits. Cosmos DB supported binary attachments, but it's deprecated and they recommend moving to Azure Blob Storage instead; that has the downside of needing to talk to two services. You might consider using a different database than Cosmos DB if you're going to be storing a lot of binary data.

For the class map:

You'd almost certainly benefit more from a general string table that applies to both property names and property values.
The string table could be represented more efficiently in JSON as an array rather than a dictionary.
For $class properties, I'd suggest a mechanism that's probably better than a prefix in general - given that in many cases the prefix will be empty or will be only "com." or something - split the $class into namespace+version and type, and index them both into the string table separately. "concerto.metamodel@1.0.0.ConceptDeclaration" might become "0.1" (where 0 is the index of "concerto.metamodel@1.0.0" and 1 is the index of "ConceptDeclaration"). Abbreviated $class values, when implemented, would be recognized by not having a period.
For indexes into the string table, it'd be a good idea to use a base-93 or so alphabet to represent the indexes rather than base-10. That will compress the indexes much better if the string table ends up becoming large.
I can also think of a way to avoid the overhead of having a separate string table - which would eliminate the string table section entirely while increasing the benefits - but it requires preservation of property order, which can sometimes be tricky. (I don't know if Cosmos DB will preserve property order - probably not - but if you're using LZ encoding on top then you could do it.) Avoiding a separate string table also has the benefit that it enables the data to be processed in a streaming fashion; otherwise, the client has to receive the string table before it can understand the data.