lossless data compression

Question

lossless data compression

deathfromheaven opened this issue 3 years ago · 4 comments

Hello，I want to know the name of the lossless data compression you use in the OpenHistorian and how to find it.

Answer 1 · 2022-03-03T15:32:23.000Z

Technically the openHistorian, being based on SnapDB, can use any compression algorithm.

The default one is designed to do a good job with a variety of streaming time-series data sources, we casually refer to this as Time-Series Special Compression (or TSSC) - but there are variations of this, e.g., the one used in STTP. Most of the compression algorithms were developed by @StevenChisholm.

For the the current implementation in openHistorian, you can find the code here, check the Encode and Decode methods:

C#: HistorianStreamEncoding.cs
Python: historianKeyValueEncoder.py

Answer 2 · 2022-03-21T08:16:27.000Z

Where can I find the specific information of the TSSC algorithm or how can I learn this algorithm. Is there any literature which you recommend about this algorithm？

Answer 3 · 2022-03-21T14:07:08.000Z

openHistorian can archive any time-series type data in a streaming fashion, so the goal is to apply general purpose streaming compression rules to specific data elements, longitudinally, and based on the nature of the data, produce good compression ratios with minimal CPU impact.

The TSSC algorithm used by openHistorian is more simple than the one used by STTP in the number of measurement states that is maintained. When the IEEE 2664 standard is released, it will contain a section and an appendix describing TSSC in greater detail.

In general, TSSC takes each of the elements of a time-series measurement and handles each data type with separate compression algorithms, creating parallel compression streams for each data element in the measurement. The nature of the data element being compressed then infers the necessary compression algorithm tuning to produce the best results.

For archived data, timestamps will be near each other, normally varying by no more than a few seconds. For the 64-bit timestamps, this means the data variation may only occur in the bottom 16 of the total 64 bits of the timestamp. With the bulk of the bits repeating invariably, the total bit set needs to be only archived once or on substantial change, then only the changing bits need to be archived.

Additionally, if the timestamps vary less, the algorithm can automatically adjust and archive even fewer bits. This same type of pattern works well for identification numbers, which are finite in number, and state flags, which vary little. Data values, however, need special attention.

Data value elements for a given measurement can be rather random, however, many values change slowly over time, from measurement to measurement. For example, a measured frequency value tends to change only incrementally over several data measurements; in fact, other frequencies in the same subscription may only differ by just a few bits. With this knowledge a few extra opcodes can be maintained that represent the unvarying bits of many types of measurements. Now only the changed bit values need to be encoded into the archive so that the stream can be reinflated without loss upon reception.

Answer 4 · 2022-03-24T07:54:17.000Z

Thank you very much！I got it.