szcompressor/cuSZp

Will the new GSZ compressor be available for testing?

Closed this issue ยท 9 comments

I saw the link regarding the new GSZ that will be presented at SC24. Looks very useful and thus wanted to see if the code will be posted here for use and testing? I'm working on compressing llm activations and seems GSZ might be an ideal fit.
Thanks for any info!

Hi, sorry for this late reply, and thanks for your interest.

I just pushed the code of GSZ in this repo [LINK]. Also, the name "GSZ" is a temporary name used in paper submission (due to the SC24 double-blind policy). And after discussion, we decided to name this compressor still as cuSZp (and its name in SC24 paper will be modified into cuSZp2 in the camera-ready version to distinguish it from other baseline compressors). As a result, a more complete version including README and explanations of cuSZp2/GSZ code will be updated in this repository.

Here is some information for repository: [LINK].

Differences between cuSZp2/GSZ and cuSZp

  • Higher Throughput: cuSZp2/GSZ has much higher throughput compared with cuSZp. Especially compressing floating-point data that are greater than 1 GB, cuSZp/GSZ can achieve around 300~500 GB/s throughput for compression and decompression. The reasons are due to optimized memory access patterns and latency control. BTW, based on my testing and algorithm cuSZp2/GSZ will have very promising results for sparse datasets (lots of 0 data points).
  • Higher Compression Ratio: cuSZp2/GSZ supports two lossless encoding modes: plain-fixed-length-encoding and outlier-fixed-length-encoding, where as cuSZp only supports plain-fixed-length-encoding.
    • For plain-fixed-length-encoding, it will have the same compression ratios as cuSZp.
    • For outlier-fixed-length-encoding. it will have higher throughputs than plain-fixed-length-encoding but also with a slight decrease in throughput (but still much faster than cuSZp). Based on my testing, the compression ratio gain will be higher if the datasets are smooth (i.e. lots of values are consecutive, such as an array with value [..., 1.4, 1.6, 1.7, 2.5, 1.8, ....]).

To compile and use cuSZp2/GSZ as executable binary

This can be found in README in this repo [LINK]. The compression/decompression kernel implementations can be found in file /src/GSZ.cu.

To use cuSZp2/GSZ as an internal API.

  1. Include header file ./include/GSZ_entry.h
  2. Compression and Decompression function:
void GSZ_compress_deviceptr(float* d_oriData, unsigned char* d_cmpBytes, size_t nbEle, size_t* cmpSize, float errorBound, cudaStream_t stream = 0);
void GSZ_decompress_deviceptr(float* d_decData, unsigned char* d_cmpBytes, size_t nbEle, size_t cmpSize, float errorBound, cudaStream_t stream = 0);

Note that these two functions assume original and compressed data arrays are both on GPU.
3. An example of using them can be found in file ./examples/example_gpu_api.cpp, and the key part can be found below (where timer_GPU denotes a strict end-to-end measurement by CUDA EVENT):

// GSZ compression.
timer_GPU.StartCounter(); // set timer
GSZ_compress_deviceptr(d_oriData, d_cmpBytes, nbEle, &cmpSize, errorBound, stream);
float cmpTime = timer_GPU.GetCounter();
        
// GSZ decompression.
timer_GPU.StartCounter(); // set timer
GSZ_decompress_deviceptr(d_decData, d_cmpBytes, nbEle, cmpSize, errorBound, stream);
float decTime = timer_GPU.GetCounter();

How to use outlier-fixed-length-encoding mode.

In this repo [LINK], the README part only executes program with plain-fixed-length-encoding. If you'd like to try outlier-fixed-length-encoding, what you can do is:

  1. Go to file: /src/GSZ_entry.cu.
  2. Replace line 162 with GSZ_compress_kernel_outlier<<<gridSize, blockSize, sizeof(unsigned int)*2, stream>>>(d_oriData, d_cmpBytes, d_cmpOffset, d_locOffset, d_flag, errorBound, nbEle);. (just replace name in function from "plain" to "outlier")
  3. Replace line 209 with GSZ_decompress_kernel_outlier<<<gridSize, blockSize, sizeof(unsigned int)*2, stream>>>(d_decData, d_cmpBytes, d_cmpOffset, d_locOffset, d_flag, errorBound, nbEle);. (just replace name in function from "plain" to "outlier")
  4. Redo compilation and execute binary gpu_api_GSZ based on information provided in README.

Later actions about cuSZp2/GSZ code.

Right now cuSZp2/GSZ is in SC24-AD-AE procedure, so this repository may be updated from time-to-time. If this repo [LINK] is updated with future commits, the instructions mentioned above will work for the 07.21.2024 commit. In all, what I will do for cuSZp2/GSZ in the future are listed above:

  • Within July this year, more detailed descriptions and execution supports (e.g. double-precision supports) will be updated in repository [LINK].
  • Within Aug this year, a complete code with a detailed manual will be updated in the repo (i.e. cuSZp). In other words, the compressor cuSZp2/GSZ will be merged into cuSZp.

Thanks again for your interests and sorry for the confusion in "compressor name" I made lol. Please let me know if there are more questions/bugs while using cuSZp2/GSZ :)

Hi @hyfshishen - thanks very much for the link and update!
I was able to work on trying to run cuZSp2 today but found that at least on H100, all compression attempts resulted in:

./gsz_p vy.f32 1e-3
Segmentation fault (core dumped)

I tried all 6 HACC files, as well as jetin, to the same effect.
I'm using CUDA 12.2 for the build and as noted on H100. No errors during the build process.

I think I'll try next by building it with outlier encoding just as an alternative and not sure if I should raise an issue on the other (main) repo and track it there?
I guess my main question is if you have had a chance to run gsz_p on H100 or only A100? (shouldn't really matter but at least to start to isolate the issue).
Hope you are doing well!

Hi @hyfshishen - rebuilt to use outlier-fixed-length-encoding.
It seems to go a lot further before the segmentation fault, but ultimately all results in same:

./gsz_p jicf_q_1408x1080x1100_float32.raw 1e-2
Segmentation fault (core dumped)

I think I'll try to analyze it a bit to see what might be going wrong here but please let me know if you have any recommendations and if you have been able to test on H100?

Hi @hyfshishen - fyi, I was able to test on A100 with Jane and we see the same segmentation fault. I am suspecting now this could be a cuda version diff? Tested with 12.2 (h100) and 12.0 (A100) but I think you were running on 11.8?

Hi @lessw2020, sorry for this late reply since I was on vacation last several days. I just tested from my side. I think the segmentation fault is due to the data padding in kernel preparation phase. I just had a quick test. I built and executed cuSZp2/GSZ successfully within CUDA 11.2 and 11.4. But on 11.6 and onwards I also meet segmentation faults/kernel failed execution. I will work on it and fix it tonight. Will provide an update very soon :)

Hi @lessw2020, I just checked repo and tried to reproduce the bug. The environment I use is on a national lab cluster, so the CUDA version is limited (only 11.2, 11.4, 11.8, and 12.3 are supported). The GPU is A100 (40GB).

From myside, 11.2, 11.4, and 11.8 can all work fine. I also meet kernel launch failure in 12.3. I checked the details, it is due to "Kernel launch failed: the provided PTX was compiled with an unsupported toolchain". So I just made an update to folder, including:

  • Adding "80" and "86" CUDA architectures in CMakeLists.txt.
  • Replacing PTX inline code in compression kernel with normal C-style code.

You can compile it by executing python 0-compilation.py (still within this folder). The generated executable binary can be found in install/bin/gsz-p or install/bin/gsz-o, referring to plain- and outlier- fixed-length encoding modes.

Could you please try this when you are available? Btw, I think using 11.X will very possibly resolve this issue. If the environment from your side is based on 12.X and bugs still exist, I will try with more details on my home PC (I have sudo on that). Thanks for your patience :)

Hi @hyfshishen - thanks for the fast update!
I was able to get your changes and rebuild and test with hacc - unfortunately I still get the same seg fault as before.
I did also try dropping the lower arches (i.e. only 80 and higher, and added 90 90a) but that did not help.
Would you be able to test on 12.x as I suspect that's the current blocker?
*for reference, most co's that I am aware of are all on 12.x since 12.0 was released in Dec 22, so almost two years ago.
Thanks much!

Hi @lessw2020 , thanks for the reply. Sure! I will test under CUDA 12.x on my side today. Will provide an update very soon ๐Ÿ˜ƒ

We have discussed this issue somewhere else, and I will set this issue as closed. Thanks for your interest. The arranged version of GSZ/cuSZp2 will be updated in this repo during SC'24 conference.