kernelci/kcidb

JQ dependency is too heavy for some setups

Opened this issue · 9 comments

Depending on the jq library makes it difficult to install kcidb in restricted environments, particularly in AWS Lambda, which e.g. Tuxsuite uses.

Consider other options, e.g.:

  • Replace jq with a custom pure-Python implementation. Shouldn't be too hard, as JSON specification is simple and well-defined, but might be rather slow on big datasets.
  • Create separate tools with streaming support (e.g. kcidb-submit-stream, kcidb-db-dump-stream, kcidb-db-load-stream, and so on), and move them to a separate package, along with jq dependency.
  • Use a library able to print out / parse JSON objects incrementally, and instead input/output complete JSON objects.

Scratch the last option. We won't be able to interleave object types this way, unfortunately. Unless we change the schema, embedding type data into objects themselves, that is.

Also, check if the stock pypi package is easy to install in AWS lambda. If it is, try to get stream parsing support merged into the upstream again. Consider implementing parsing file objects instead of iterators/generators, as the maintainer requested.

  • Replace jq with a custom pure-Python implementation. Shouldn't be too hard, as JSON specification is simple and well-defined, but might be rather slow on big datasets.

@spbnick
I'm thinking of picking this issue up, specifically using this option.

Is there extra information I need to know?

@mrbazzan, that would be fun to do indeed! If you want to do that, here are some of the requirements:

  • Do not require loading the whole of the source JSON text in memory for parsing, at once, but only chunks of limited (and optionally-specified) size. E.g. 4KB-4MB.
  • Be no more than 3x (or thereabouts) slower than JQ parsing our current loads: gigabytes of JSON, split into objects with, say, 10K report objects each. I can provide a sample to run tests on. The tests for this would need to be continuously running along with development, starting from the moment when parsing our input becomes possible. And yes, "3x" is picked based on a feeling of what could be acceptable, no other real requirements at the moment.

You can work on that in your own repo, making your own package, etc., but I would need to review the solution before using it and accepting the dependency. The likelihood of success would be increased with early feedback, though.

@mrbazzan, that would be fun to do indeed! If you want to do that, here are some of the requirements:

  • Do not require loading the whole of the source JSON text in memory for parsing, at once, but only chunks of limited (and optionally-specified) size. E.g. 4KB-4MB.
  • Be no more than 3x (or thereabouts) slower than JQ parsing our current loads: gigabytes of JSON, split into objects with, say, 10K report objects each. I can provide a sample to run tests on. The tests for this would need to be continuously running along with development, starting from the moment when parsing our input becomes possible. And yes, "3x" is picked based on a feeling of what could be acceptable, no other real requirements at the moment.

You can work on that in your own repo, making your own package, etc., but I would need to review the solution before using it and accepting the dependency. The likelihood of success would be increased with early feedback, though.

@spbnick Okay.

A pure python implementation for binding jq, rather than having to use jq.py, right?

Also, please kindly provide sample data to run test on.

Also, I'm still pretty confused. I went through the jq.py repo and ...

I'm really interested in this project, and I would appreciate further guidance

A pure python implementation for binding jq, rather than having to use jq.py, right?

We can't really use a pure-Python implementation for binding jq, since it's written in C.

So we might need to write a pure-Python (using only standard library) parser for JSON object streams. That's all we need from JQ - the ability to parse a sequence of JSON objects without loading the whole file into memory.

Another option maybe is to work with upstream further for incorporating our changes (I got to the point of the author ignoring me 😬), or maybe making our own binding for jq, just for stream parsing. In either case we would need a compiled package on PyPi, and a verification that e.g. Amazon Lambda can handle it.

Here's a release tag of our fork of jq.py, if you're interested in that: https://github.com/kernelci/jq.py/tree/1.2.1.post1

Also, please kindly provide sample data to run test on.

You can start with the sample I already provided, just cat all the files together, and once you get to performance tests, I can give you a larger one.

We can't really use a pure-Python implementation for binding jq, since it's written in C.

So we might need to write a pure-Python (using only standard library) parser for JSON object streams. That's all we need from JQ - the ability to parse a sequence of JSON objects without loading the whole file into memory.

Oh... I think I have a better understanding of the problem now. We want a package that offers the parsing ability of JQ(without loading the whole file in memory) but with standard Python packages, so as to make it easy to install kcidb in environments like AWS lambda, right?

Oh... I think I have a better understanding of the problem now. We want a package that offers the parsing ability of JQ(without loading the whole file in memory) but with standard Python packages, so as to make it easy to install kcidb in environments like AWS lambda, right?

Yep.

Or, if compiled pypi packages work in AWS after all, either work with upstream to integrate our changes, or make our own pypi package binding jq just for parsing.