NVIDIA/nvcomp

[FEA] Requesting "fall back" CPU implementations of certain decompression algorithms

benkirk opened this issue · 13 comments

Is your feature request related to a problem? Please describe.
No problems...

Describe the solution you'd like
It would be very beneficial to have CPU-only implementations of nvCOMP decompression algorithms. My limited understanding is that some compressed streams created my nvCOMP are not a 'compliant' byte stream, meaning for example a standard CPU-based lz4 decompressor may not be able to handle nvCOMP compressed streams due to additional information embedded in the stream (or something along those lines).

Additional context
NCAR recently became aware of the power of nvCOMP in an Open Hackathons event. Coupled with GPU-Direct Storage (GDS), this could be a very powerful tool for the Earth Sciences community. In order to deploy nvCOMP in this setting, however, it is imperative that any compressed files created through nvCOMP be usable on any machine, in particular machines with no GPUs whatsoever. Thus, an enabling capability for this use would be the ability for nvCOMP to decompress its own streams, on systems without GPUs.

Describe alternatives you've considered
It could be sufficient for nvCOMP to call pre-existing CPU decompressors, after it has handled any implementation-specific stream issues.

Hi! We provide examples of using CPU LZ4 compression and decompression interoperating with the nvCOMP GPU LZ4 decompression and compression. Are these not working for you? Could you be referring to nvCOMP's high-level interface that has a bit more metadata embedded in it? We also have examples for interoperation with CPU Deflate in both directions, and although we only seem to have an example for one of the directions for Gdeflate, it seems to support both directions. Were you wanting other compressors to have a CPU fallback?

We'd definitely be interested in a CPU-only decode path through the high-level interface, if at all possible. Since the additional metadata makes the stream unusable by another decompressor, this feature could be enabling.

As for which compressors; it is a bit of an open question until we know more about the performance of nvCOMP's algorithms for our specific climate/weather data sets. After some discussions at the Hackathon, we expect Zstd and Bitcomp both to do well, but again we'd need some more testing on our side.

@benkirk I think that the appropriate place for this may be in the hdf5 filter, the filter could detect whether the nvcomp library and gpu were present and, if not, call the appropriate cpu method. But again this requires that nvcomp compressed files can be decompressed using the cpu based implementation.

@jedwards4b Would it not make sense to use the nvcomp low-level API to build a batched HDF5 filter instead of using the nvcomp high-level interface? The low level interface does not add any additional metadata and is standards compliant so you could use an existing LZ4 CPU filter to read an HDF5 file that was generated using the nvcomp LZ4 filter based on the low-level API. The main requirement here would be to have the ability to (de)compress batches of chunks.

Yes, I think that the reason we used the high level interface for the hackathon was in the interest of expedience. We will look into using the
low level interface instead.

Thanks, please let us know if you encounter any issues with this approach. We are looking into the same approach for zarr for weather and climate data as well and there is probably enough overlap in the two use cases.

For the hackathon, we only had a single large HDF5 chunk. In order to use the LLIF on a such a chunk, you'd need to manually chunk the HDF5 data to pass to the LLIF. You'd need to store metadata regarding the size of each chunk. You'd also need to manage compacting the chunks together after compression.

The HLIF made sense in the context of a single large chunk because it handles the metadata, chunking, and compaction for you.

The LLIF however would make a ton of sense for a batched HDF5 read/writer as Akshay mentioned. In this case there's no further chunking, and thus no compaction and no metadata. You'd just need to provide smaller chunks directly to the LLIF, 1 per HDF5 write.

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

This issue does still need to be addressed. I believe that the ball is in my court to rewrite the interface using the lowlevel routines.

Hi Jim,

We'll include CPU HLIF decompression in the upcoming 3.0 release. It's not much work and we think it will be generally useful for the community.

I'm not sure I understand your comment, though. If you switch to LLIF, you won't need this anymore, right?

With the LLIF as-is, each chunk is able to decompress using existing CPU decompressors (mainline zstd, deflate, etc.).

Thanks,
Eric

I haven't tried it yet, I would like to try it and confirm.

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.