nvCOMP is a CUDA library that features generic compression interfaces to enable developers to use high-performance GPU compressors and decompressors in their applications.
nvCOMP 2.0.0 includes Cascaded, LZ4, and Snappy compression methods. It also adds support for the external Bitcomp and GDeflate methods. Cascaded compression methods demonstrate high performance with up to 500 GB/s throughput and a high compression ratio of up to 80x on numerical data from analytical workloads. Snappy and LZ4 methods can achieve up to 100 GB/s compression and decompression throughput depending on the dataset, and show good compression ratios for arbitrary byte streams.
Below are compression ratio and performance plots for three methods available in nvCOMP (Cascaded, Snappy and LZ4). Each column shows results for a single column from an analytical dataset derived from Fannie Mae’s Single-Family Loan Performance Data. The numbers were collected on a NVIDIA A100 80GB GPU (with ECC on).
nvCOMP 2.0.0 features new flexible APIs:
- Low-level is targeting advanced users — metadata and chunking must be handled outside of nvCOMP, low-level nvCOMP APIs perform batch compression/decompression of multiple streams, they are light-weight and fully asynchronous.
- High-level is provided for ease of use — metadata and chunking is handled internally by nvCOMP, this enables the easiest way to ramp up and use nvCOMP in applications, some of the high-level APIs are synchronous and for best performance/flexibility it’s recommended to use the low-level APIs.
Please note, that in nvCOMP 2.0.0 some compressor are only available either through the Low-level API or through the High-level API.
Below you can find instructions on how to build the library, reproduce our benchmarking results, a guide on how to integrate into your application and a detailed description of the compression methods. Enjoy!
This release of nvCOMP introduces new interfaces and compression methods.
- Cascaded compression requires a large amount of temporary workspace to operate. Current workaround is to compress/decompress large datasets in pieces, re-using temporary workspace for each piece.
Pascal (sm60) or higher GPU architecture is required. Volta+ GPU architecture is recommended for best results.
To configure nvCOMP extensions, simply define the NVCOMP_EXTS_ROOT variable
to allow CMake to find the library
First, download nvCOMP extensions from the nvCOMP Developer Page. There two available extensions.
- Bitcomp
- GDeflate
git clone https://github.com/NVIDIA/nvcomp.git
cd nvcomp
mkdir build
cd build
cmake -DNVCOMP_EXTS_ROOT=/path/to/nvcomp_exts/${CUDA_VERSION} ..
make -j4
nvCOMP uses CMake for building. Generally, it is best to do an out of source build:
git clone https://github.com/NVIDIA/nvcomp.git
mkdir build
cd build
cmake ..
make -j
If you're building using CUDA 10 or less, you will need to specify a path to CUB on your system, of at least version 1.9.10.
cmake -DCUB_DIR=<path to cub repository>
To obtain TPC-H data:
- Clone and compile https://github.com/electrum/tpch-dbgen
- Run
./dbgen -s <scale factor>, then grablineitem.tbl
To obtain Mortgage data:
- Download any of the archives from https://rapidsai.github.io/demos/datasets/mortgage-data
- Unpack and grab
perf/Perforamnce_<year><quarter>.txt, e.g.Perforamnce_2000Q4.txt
Convert CSV files to binary files:
benchmarks/text_to_binary.pyis provided to read a.csvor text file and output a chosen column of data into a binary file- For example, run
python benchmarks/text_to_binary.py lineitem.tbl <column number> <datatype> column_data.bin '|'to generate the binary datasetcolumn_data.binfor TPC-H lineitem column<column number>using<datatype>as the type - Note: make sure that the delimiter is set correctly, default is
,
Run tests:
- Run
./bin/benchmark_cascaded_autoor./bin/benchmark_lz4with-f column_data.bin <options>to measure throughput.
Below are some example benchmark results on a RTX 3090 for the Mortgage 2000Q4 column 0:
$ ./bin/benchmark_cascaded_auto -f ../../nvcomp-data/perf/2000Q4.bin -t long
----------
uncompressed (B): 81289736
comp_size: 2047064, compressed ratio: 39.71
compression throughput (GB/s): 225.60
decompression throughput (GB/s): 374.95
$ ./bin/benchmark_lz4 -f ../../nvcomp-data/perf/2000Q4.bin
----------
uncompressed (B): 81289736
comp_size: 3831058, compressed ratio: 21.22
compression throughput (GB/s): 36.64
decompression throughput (GB/s): 118.47