PanDAWMS/dkb

pyDKB: multiple JSON messages in one output file, or "do we really need JSON Array"?

Closed this issue · 2 comments

In the current version of pyDKB (#22) if we run JSONProcessorStage in file output mode (putting output messages into file) we`ll have file of this look:

{{{
{first JSON message}
{second JSON message}
...
}}}

But we would expect to get something like:

{{{
[
{first JSON message},
{second JSON message},
...
]
}}}

It happens due to the message-oriented architecture: it treats every message individually and is not aware of the fact that the messages are somehow groupped together because they have same output file (and thus must be put into a JSON array before output).

The main question here is: do we really need this JSON Array and why?

Because I see following ways to solve the issue:

  1. (dirty way) By making quite a lot of the inner (semi-private) variables accessable from outside the Abstract Processor Class

We won`t join messages into JSON Array really, but will mimic the behaviour by changing the delimeter (adding comma between messages) and write '[' and ']' in the beginning and the end of file.

  1. (right way) By rethinking the architecture:
  • move all the input/output functionality into new classes: InputMessages and OutputBuffer
  • make those new classes aware of the message types (JSONInputMessages, TTLInputMessages etc)
  • replace JSONProcessor, TTLProcessor, JSON2TTLProcessor stages to the stage constructor like this: ProcessorStage(messageType.JSON, messageType.TTL) would return an object similar to the instance of JSON2TTLProcessor

It will allow more accurate adjusting of the input/output processes for different data types.

It is a right way, and one day it must be done (because there already are some other difficulties that might be simplified by this approach).

  1. (cheat way) By rethinking the expectations

If we can agree, that input JSON files are not (or not only) a true JSON, but (can be) a NDJSON (Newline Delimeted JSON), then the issue is almost solved: NDJSON is exactely what we have in the output right now, and the only change needed is to adjust the input methods to accept NDJSON files.


For me the last way looks the best: easy to do (even easier than the first one) and "clean" in terms of the architecture.
But it means that we slightly change the format of the transitional files. It will have no affect on the global picture and the pyDKB-using programs, but if those files are used in other applications but Dataflow Stages, there may be a bit of surprise.


People working with Dataflow (@maria-grigorieva, @Evildoor, @anastasiakaida), what do you think?

I'm not sure that changing input/output files from true JSON, which is widely acknowledged and supported, to something else is a good idea (at least over something as small as this). If it's a quick temporary solution that will not affect the rest of the workflow and will be replaced later, then it will do, i suppose. But, if it was up to me, i would've preferred the second option, unless it turned out to be way too difficult and time consuming.

@Evildoor ,

NDJSON (http://ndjson.org/) is also quite a popular solution: https://en.wikipedia.org/wiki/JSON_Streaming

@maria-grigorieva gave a nod to try this approach, the results can be seen in the commit 9eab96a (doesn`t mean it will necessary go to the final version).

We can now try how it works and see if it has some side effects we haven`t forseen.

Again, it must do no affect on something outside the pyDKB, except that the output files will become less human-readable.

Adding the --pretty-print option is also not so easy in the current architecture, but after the changes suggested in item 2 it must become easier. If it is an urgent need, I`d suggest to put it to a separate issue.
I`m afraid I won`t be able to rethink the library inner structure, implement it new way and test properly till the so-called "Dataflow Deadline" (June 30), but definitely will do it later.