What is the form of the output?

Question

What is the form of the output?

papb opened this issue 3 years ago · 3 comments

Hello! I noticed that this module returns an array of numbers.

What can I expect of the sizes of each number? The largest I got was 303 on a small example, and the smallest was 34. Is there a guaranteed lower bound? And upper bound?
What can I expect of the length of this array?

Answer 1 · 2021-05-21T12:35:08.000Z

@papb - It looks like you might have confused the compression implemented in this library for something else. This library is useful for applying the popular Lempel–Ziv–Welch lossless compression algorithm on large JSON objects, so that it can be persisted into NoSQL databases or into the IndexedDB on the browser (for offline use).

As mentioned on the README, the output of the pack() call cannot be consumed directly - but if you persist this "packed" output instead of the original JSON into a NoSQL database, you should see the difference in the space utilization.

When you want to consume the "packed", large JSON object - you will then retrieve it from your persistence layer and the call unpack() from the library.

Does that help?

Answer 2 · 2021-05-21T18:57:52.000Z

Hello, thanks for the fast reply! I already knew what you said, sorry for not being clear on the question. I don't have any knowledge on how the LZW algorithm works - what I know is that it is a compression algorithm. The fact that the output of this algorithm consists of an array of numbers was something I didn't know, and learned by running this module. Now I want to know what can I expect about the bounds in these numbers and the array length, in order to think better on the persistence layer that I will use. For example, if I have a way to persist an array of numbers from 0 to 1023, can I use it? Or is it possible that this algorithm generates a number above 1023 at some point? I couldn't find this in a quick look into Wikipedia on LZW so I thought I'd ask here. I also looked at the source code to see if I could deduce something but couldn't.

Answer 3 · 2021-05-21T21:29:37.000Z

I don't think we can definitively say what the upper bound is, for the numbers in the "compressed" array - and the reason it is like that is because the numbers will depend on the specific input passed into the core LZW compress function (which takes a random string and converts it into an array).

At the core (as can be seen on the function), the LZW algorithm starts with the ASCII character set loaded into a dictionary/map. And during the processing of a given input, it will start accumulating string segments that are repeated within.

Here is an example input of "awesome awesome", that is being processed by this core compress function along with the data structure within the function visualized as it is processed: Link to JavaScript Visualizer in PythonTutor.com

As you can see, for this input - the numbers go all the way up to 260 - hopefully this gives some insight into what you are dealing with.