NVIDIA/nvcomp

[QST] Is it available BITCOMP decompression for cpu?

andreamartini opened this issue · 4 comments

In the same way of deflate, zstandard, lz4 decompression, with which it is possible to decompress on cpu something compressed via GPU and nvcomp (i.e. using low level approach), i am interested in BITCOMP cpu decompression. That is, compress on GPU using nvcomp and BITCOMP, and then decompress on CPU (using BITCOMP or, maybe, other algorithm). If yes, is there some links where i can take a look?
Thank you.

Andrea,
yes you can use bitcomp on CPU, but you'll have to use the bitcomp native API as nvcomp doesn't expose this feature (and the lossy FP compression). The native API might also save some overhead and be a little faster.

Bitcomp's native API works with a handle, the idea is to create a plan (data type, size, algorithm) and then this plan can be use over and over to compress or decompress data of the same size and type, on GPU (completely asynchronously, all the info needed to launch kernels is in the handle) or on CPU.
The CPU API is single threaded so it's going to be slow, but decompression can be easily parallelized by using the partial decompression. For example, a 1GB buffer can be parallelized on 10 threads, each thread performing a partial decompression of a 100MB chunk, with you choice of parallelism (OpenMP, pthreads, ...)

Check the API in native/bitcomp.h.

Here is an example of mixed compression and decompression on GPU and CPU using the native API.
(BTCHK and CUCHK are just macros to check the returned values for Bitcomp and CUDA calls)

    // Including the native bitcomp include file instead of nvcomp's
    #include <native/bitcomp.h>
    <...>

     // All buffers allocated as managed memory to be visible on CPU and GPU
    CUCHK(cudaMallocManaged((void **)&data, size));
    CUCHK(cudaMallocManaged((void **)&decomp, size));
    <...> // Input data init

    // Creating a bitcomp handle to compress floating point data
    // For lossless compression, we treat the FP32 data as unsigned int (same 32b size)
    bitcompHandle_t handle;
    BTCHK(bitcompCreatePlan(&handle,
                            size,
                            BITCOMP_UNSIGNED_32BIT,
                            BITCOMP_LOSSLESS,
                            BITCOMP_DEFAULT_ALGO));

    // Query the maximum output size to allocate the compressed buffer
    size_t lcomp = bitcompMaxBuflen(size);
    printf("Max compressed buffer size (%lu) = %lu\n", size, lcomp);
    void *output;
    CUCHK(cudaMallocManaged(&output, lcomp));

    // Launch the compression (asynchronous) on GPU
    // The handle could be associated with another stream if needed.
    BTCHK(bitcompCompressLossless(handle, data, output));

    // Wait for the compression to finish before querying the compressed size
    CUCHK(cudaDeviceSynchronize());

    size_t compsize;
    BTCHK (bitcompGetCompressedSize (output, &compsize));
    float ratio = (float)size / (float)compsize;
    printf ("Compressed (GPU) to %lu bytes, ratio = %.1f\n", compsize, ratio);

    // Decompress the data on CPU
    BTCHK(bitcompHostUncompress(handle, output, decomp));

    // Decompressed data should be the same as the original (lossless)
    for (int i=0; i<n; i++)
        if (decomp[i] != data[i])
        {
            printf ("Error, decompressed data does not match\n");
            return -1;
        }
    printf ("Decompressed on CPU, matches original data\n");

    // Compress the data on CPU
    BTCHK(bitcompHostCompressLossless(handle, data, output));
    BTCHK (bitcompGetCompressedSize (output, &compsize));
    ratio = (float)size / (float)compsize;
    printf ("Compressed (CPU) to %lu bytes, ratio = %.1f\n", compsize, ratio);

    // Decompress the data on GPU (asynchronous)
    BTCHK(bitcompUncompress(handle, output, decomp));
    
    // Wait for the decompression to finish
    CUCHK(cudaDeviceSynchronize());
    

Managed memory is used here to make the buffers visible on CPU and GPU, just to mix CPU and GPU for the example.

Bitcomp can compress over PCIe (or nvlink) as well, for example with the uncompressed data in GPU memory, and the output buffer in host pinned memory. It works well if the compression is high enough for the PCIe bus to not be a bottleneck.
If you want to do that, you should also call this to speed up the compression (it allocates a tiny bit of memory locally on the device to cache some metadata of the output buffer. The handle becomes "optimized" for the current device and can't be used for several compression in parallel in different streams, but that doesn't apply to CPU calls, they won't even use this cache).

   * @brief Turn on compression acceleration when the compressed output is not in the global memory
   * of the device running the compression (e.g. host pinned memory, or another device's memory)
   * This is optional and only affects the performance.
   * NOTE: This makes the handle become device-specific. A plan that has this acceleration turned on
   * should always be used on the same device.
   * 
   * @param handle (input) Bitcomp handle
   * @return Returns BITCOMP_SUCCESS if successful, or an error
   */
  bitcompResult_t bitcompAccelerateRemoteCompression(bitcompHandle_t handle);

Hi gthomascollignon, thank you for your reply. Where i can find native bitcomp repository? I'm having some trouble finding it through google engine.

I found https://github.com/boyuanzhang62/bitcomp_lossless_example/blob/master/bitcomp_example.cu
I suppose it needs cuda and graphic board cuda compliant. Isn't true?

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.