[BUG] The low-level interface for batched deflate decompression can only work for the first block
Autumn1998 opened this issue · 6 comments
I am trying to decode some deflate blocks which is put on a byte stream. In fact, I preprocess the data to divide it into some blocks. For alignment with the start point, some blocks shift some bits to left. At the end of the block, if the data cannot be aligned, some zeros will be filled.
Then I convert these blocks to nvCOMP, but I get the wrong result (only the first block show the correct result) when invoking:
nvcompStatus_t decomp_res = nvcompBatchedDeflateDecompressAsync(
device_compressed_ptrs,
device_compressed_bytes,
device_uncompressed_bytes,
device_actual_uncompressed_bytes,
batch_size,
device_decomp_temp,
decomp_temp_bytes,
device_uncompressed_ptrs,
device_statuses,
stream);
It seems that every parameter is correct because I got the correct result by:
for(int i = 0;i<batch_size;i++)
{
nvcompStatus_t decomp_res = nvcompBatchedDeflateDecompressAsync(
device_compressed_ptrs + i ,
device_compressed_bytes + i,
device_uncompressed_bytes + i,
device_actual_uncompressed_bytes + i,
1,
device_decomp_temp,
decomp_temp_bytes,
device_uncompressed_ptrs + i,
device_statuses + i,
stream);
}
I type the value of "device_statuses" and this two methods show the same results:
check the state of 0 => state:11
check the state of 1 => state:11
check the state of 2 => state:11
check the state of 3 => state:11
check the state of 4 => state:11
check the state of 5 => state:11
...
check the state of 256 => state:0
Only the last block's state is SUCCESS ?
All returned valued by functions are nvcompSuccess.
Thera are any possible reasons for this problem?
Thanks for you kindly attention.
by the way, nvcompBatchedDeflateDecompressGetTempSize(batch_size, max_chunk_size, &decomp_temp_bytes)
always set 0 to "decomp_temp_bytes", it means deflate do not need buffer anymore?
Hi! Is there any way that you could provide a full example that reproduces the issue? From what you've provided, it's difficult to guess what the issue might be. We do have tests that test the Deflate compressor with multiple blocks, both compression and decompression, to ensure that the data is the same as it was before.
A wild guess: How are you synchronizing before checking the results? nvcompBatchedDeflateDecompressAsync
, like the other nvCOMP functions that launch CUDA kernels, launch them asynchronously, so that other work can overlap with them and more kernels could be launched in the meantime. If you don't synchronize the stream, you could end up reading stale data. You can find an example of how to use nvcompBatchedDeflateDecompressAsync
and cudaStreamSynchronize
here: https://github.com/NVIDIA/nvcomp/blob/main/examples/deflate_cpu_compression.cu
Also, are you splitting the data into chunks before compression? If you take compressed data and arbitrarily split it into chunks afterward, that wouldn't work, because the chunks need to correspond with the chunks originally used for compression.
Thanks for your early reply.
The cudaStreamSynchronize(stream)
is processed before test the result.
I have never used the compression functions, I just use the deflate decompression interface to decode some deflate blocks.
I get the uncompressed block length by
nvcompBatchedDeflateGetDecompressSizeAsync(
device_compressed_ptrs,
device_compressed_bytes,
device_uncompressed_bytes,
batch_size,
stream);
Therea are n(parameter "batch_size" in code) deflate blocks in total. The problem is that , when I handle all n blocks in nvcompBatchedDeflateDecompressAsync
, I find only the first block return correct result.
As I showed in the code, I run the nvcompBatchedDeflateDecompressAsync
n times, I got the correct result for all blocks.
That is
// run all the N blocks at one go
nvcompBatchedDeflateDecompressAsync(
...
batch_size,
...
)
=> only the first block get the correct result.
// loop for each block
for(int i = 0; i < batch_size; i++)
{
// handle 1 block
nvcompBatchedDeflateDecompressAsync(
...
1,
...
)
}
=> get the correct result for all result
Attached are some deflate blocks at my use. Every file stores a deflate block at byte stream.
DeflateBlocks.zip
This is the main code used in my program:
deflate_code.txt
I check the result compared with the CPU library.
I would appreciate it if you could find any issues in my operations.
@Autumn1998 That is the right behavior actually. With deflate blocks, only the entropy decoding is independent across blocks, but LZ match copies create dependencies between blocks. For example, a match copy code of the second block can reference the result of the first block’s decompressed stream. This is probably why running serially works but running them concurrently does not.
nvcomp’s deflate algorithm decompresses a whole deflate stream rather than just deflate blocks because of that LZ block dependency behavior. To use the batched decompression API, you would need to pass in multiple deflate streams and each stream may have one or more deflate blocks. But all that the nvcomp API cares about is how many streams there are.
This issue has been labeled inactive-30d
due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d
if there is no activity in the next 60 days.
This issue has been labeled inactive-90d
due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.