sjson (WIP)

Python: Stream JSON parser/merger, designed to work for large/many JSON files.

Stream Merging Large JSON Files

Steps:

Define a JSON structure (no actual value) with merging patterns
Generate independent files (cache) to store each merging pattern (list or dict)
Read an actual JSON file, according to the pattern(path), stream output into the cache files
Iterate each JSON file to repeat the above step
Reading each cache file (in stream), find the corresponding pattern position, write the content (in stream) to override the "pattern symbole"

Assumption:

We must know the exact JSON structure before merging
If we don't know the structure, it will always override

Structure example (struct.json):

{
    "code": 200,
    "report": "Person Profile",
    "name": "<replace-by-str-override>",
    "age": "<replace-by-int-max>",
    "count": "<replace-by-int-sum>",
    "products": "<replace-by-list>",
    "properties": "<replace-by-dict>",
    "subsets": {
    	"deep-nest": "<replace-by-dict>"
	}
}

Input file examples (a.json and b.json):

# a.json
{
    "code": 200,
    "report": "Person Profile",
    "name": "Jason",
    "age": 30,
    "count": 2,
    "products": [
        {"name": "a", "price": 1},
        {"name": "b", "price": 2}
    ],
    "properties": {
        "books": ["a", "b", "c"],
        "games": [1, 2, 3]
    },
    "subsets": {
        "deep-nest": {
            "students": ["a", "b", "c"]
        }
    },
    "override": [
        "a": "a",
        "b": "b"
    ]
}

# b.json
{
    "code": 200,
    "report": "Person Profile",
    "name": "Sam",
    "age": 40,
    "count": 3,
    "products": [
        {"name": "c", "price": 3},
        {"name": "d", "price": 4}
    ],
    "properties": {
        "books": ["d", "e", "f"],
        "games": [4, 5, 6]
    },
    "subsets": {
        "deep-nest": {
            "friends": ["d", "e", "f"]
        }
    },
    "override": [
        "c": "c",
        "d": "d"
    ]
}

Result example (result.json):

{
    "code": 200,
    "report": "Person Profile",
    "name": "Sam",
    "age": 40,
    "count": 5,
    "products": [
        {"name": "a", "price": 1},
        {"name": "b", "price": 2},
        {"name": "c", "price": 3},
        {"name": "d", "price": 4}
    ],
    "properties": {
        "books": ["a", "b", "c", "d", "e", "f"],
        "games": [1, 2, 3, 4, 5, 6]
    },
    "subsets": {
        "deep-nest": {
            "students": ["a", "b", "c"],
            "friends": ["d", "e", "f"]
        }
    },
    "override": [
        "c": "c",
        "d": "d"
    ]
}

Python code:

import sjson

struct = open('struct.json').read()  # It can also be an empty {}
stream = sjson.Stream(struct=struct)

d1 = stream.loads(path='data-from-csv-1.json')  # ==> write content into different cache files
d2 = stream.loads(path='data-from-csv-2.json')  # ==> write content into different cache files

stream.dumps(path='final.json')  # ==> Merge cache files

Algorithm:

Flatten the structure.json into 1-level
Generate "command" for each key in the flattened JSON
Generate tmp names/files for each command
Use the original structure JSON as the "base data"
Traverse the k/v of each input JSON
Compare the current k/v with flattened-json to see if there's a command
Follow the command for current key, write current value into corresponding cache
Continue traversal until the file is finished
Repeating same operations to each file until all files are finished
Replace each "command/pattern" with the content from cache file

solomonxie/sjson

sjson (WIP)

Stream Merging Large JSON Files