NVIDIA/nvcomp

[FEA] New Delta encoding scheme

ser-mk opened this issue · 2 comments

In the act of solving my task, I came across an interesting compress solution for integer sequences with large periodic gaps. It becomes easier to visualize the data:
my_data
[40, 56, 41, 60, 88, 90, 82, 2, ….]

Though I was unable to find any information on what this is. If someone know's more about the topic could you perhaps provide some resources to learn more about it.

The idea is simple: a delta calculation using overflow wrapped around the interval maximum and using underflow wrapped around the interval minimum.

You get such the integer sequence if you try to general delta encode the previous data.

intervalA

Output encoding values of interval A have a high bitwidth (8 bits) and can’t be compressed effectively by the BitPacking algorithm. [-85, +108, -106] values enhance the bitwidth of integer interval A.
P1-4_cut

Let’s imagine the values are larger than the maximum value(P3 value) of interval A will be overflowed like unsigned char integer
P3 + 1 = P2 like 255 + 1 = 0

And the values are less than the minimum value(P2 value) of interval A will be overflowed like unsigned char integer
P2 - 1 = P3 like 0 - 1 = 255

Therefore, we can recalculate large deltas (-85, +108, -106) using these property and choose the minimum delta
delta P1 to P2 = (P3 - P2) - (P1 - P2) + 1 = 108 - 85 + 1 = 24

delta P2 to P3 = (P3 - P2) - (P3 - P2) - 1 = -1 // - (P3 - P2) - 1 because P3 > P2

delta P3 to P4 = (P3 - P2) - (P3 - P4) + 1 = 108 - 106 + 1 = 3

recalulate-min-max

Merge recalculated deltas [+24, -1, +3] with previous deltas [16, -6, 21, 10, 3, 15, -2, 17, 6]

[16, -6, 21, 10, 3, 15, -2, 17, 6, 24, -1, 3]

These deltas compress better for BytePack algorithm. Its bitwidth is just 5 bits (8 bits in previous case).
This approach improves the cascade compression of my dataset by several times as compared with general delta encoding.

I implemented this algorithm for NVCOMP lib. It just draft. Without optimizations
ser-mk@fb6b66b

Also, I compared my delta encoding / general encoding for a dataset derived from Fannie Mae’s Single-Family Loan Performance Data. And some datasets derived from kaggle
https://www.kaggle.com/datasets/xhlulu/cpc-codes
https://www.kaggle.com/datasets/giovamata/airlinedelaycauses
https://www.kaggle.com/datasets/gauravduttakiit/covid-19

I ran 63 tests with next options combination:

The number of Run Length Encodings to perform. The number of Delta Encodings to perform. Whether or not to bitpack the final layers. The size of each chunk of data to decompress independently with Cascaded compression
0 1 1 512
0 1 1 1024
0 1 1 2048
0 1 1 4096
0 1 1 8192
0 1 1 12288
0 1 1 16384
0 2 1 512
0 2 1 1024
0 2 1 2048
0 2 1 4096
0 2 1 8192
0 2 1 12288
0 2 1 16384
0 3 1 512
0 3 1 1024
0 3 1 2048
0 3 1 4096
0 3 1 8192
0 3 1 12288
0 3 1 16384
1 1 1 512
1 1 1 1024
1 1 1 2048
1 1 1 4096
1 1 1 8192
1 1 1 12288
1 1 1 16384
1 2 1 512
1 2 1 1024
1 2 1 2048
1 2 1 4096
1 2 1 8192
1 2 1 12288
1 2 1 16384
1 3 1 512
1 3 1 1024
1 3 1 2048
1 3 1 4096
1 3 1 8192
1 3 1 12288
1 3 1 16384
2 1 1 512
2 1 1 1024
2 1 1 2048
2 1 1 4096
2 1 1 8192
2 1 1 12288
2 1 1 16384
2 2 1 512
2 2 1 1024
2 2 1 2048
2 2 1 4096
2 2 1 8192
2 2 1 12288
2 2 1 16384
2 3 1 512
2 3 1 1024
2 3 1 2048
2 3 1 4096
2 3 1 8192
2 3 1 12288
2 3 1 16384
2 3 1 16384

For general Delta Encoding and Delta Encoding with my Algorithm. I called it MinMax Delta mode or M2 delta mode for short.
The results were very curious. This is the top-10 :

Datasets average difference ratio between compress M2 delta and general delta for all 63 test maximum difference ratio between compress M2 delta and general delta for all 63 test
CooperativePatent_csv_5 -18.12% -54.88%
Performance_2001Q3_txt_1_0_26 -15.92% -51.22%
Performance_2000Q2_txt_0_26 -16.22% -42.01%
DelayedFlights_csv_28_add_1 -8.43% -37.85%
DelayedFlights_csv_28_add_0 -8.14% -33.04%
Performance_2001Q3_txt_0_1_26 -13.28% -32.68%
true_car_listings_csv_7 -19.15% -32.28%
Performance_2001Q2_txt_0_1_26 -8.18% -30.38%
Performance_2001Q3_txt_1_1_26 -11.60% -28.01%
Performance_2001Q4_txt_0_1_26 -10.18% -26.44%

Full test log here https://github.com/ser-mk/nvcomp_datasets/blob/main/logs/results-2-pretty.csv

There are datasets have the maximum difference ratio more 50%
11 datasets dataset have the maximum difference ratio more 25%
~30% datasets have the maximum difference ratio more 10%
~60% datasets have the maximum difference ratio more 2%

I didn’t run the benchmark tests of library because the algorithm implementation is not optimized. If the community would be interested in this approach, I will prepare an improved version. But I want to point out the M2 delta to be less fast compared with general delta encoding.

P. S.
I’m relatively new to data compression ad GPU so I would like to know your opinion on this approach.

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.