[QST]Is the max_compressed_buffer_size in CompressionConfig correct?

Question

[QST]Is the max_compressed_buffer_size in CompressionConfig correct?

XLzed opened this issue 3 years ago · 8 comments

What is your question?
I follow the high_level_quickstart_example and try to use high level api , but I find that the value returned by get_compressed_output_size() is larger than max_compressed_buffer_size in CompressionConfig. This always happens when the chunk_size is equal to input_buffer_len. When I want to copy compressed data from device to host according to compressed_output_size, cudaMemcpy() failed because the size of memory is max_compressed_buffer_size which is less than compressed_output_size.
So how to allocate enough memory for compressed_buffer and get the correct size of compressed output?

Answer 1 · 2022-03-10T12:28:14.000Z

The Code I test:

  const int chunk_size = 1 << 16;
  size_t input_buffer_len = 1 << 16;
  nvcompType_t data_type = NVCOMP_TYPE_CHAR;
  LZ4Manager nvcomp_manager{chunk_size, data_type, stream};
  CompressionConfig comp_config = nvcomp_manager.configure_compression(input_buffer_len);
  uint8_t* comp_buffer;
  CUDA_CHECK(cudaMalloc(&comp_buffer, comp_config.max_compressed_buffer_size));
  nvcomp_manager.compress(device_input_ptrs, comp_buffer, comp_config);
  size_t compressed_size = nvcomp_manager.get_compressed_output_size(comp_buffer);
  std::cout << "max_compressed_buffer_size: " << comp_config.max_compressed_buffer_size << std::endl
                << "actual compressed_buffer_size: " << compressed_size  << std::endl;
  uint8_t* comp_buffer_cpu;
  CUDA_CHECK(cudaMalloc(&comp_buffer_cpu, compressed_size));
  CUDA_CHECK(cudaMemcpy(comp_buffer_cpu, comp_buffer, compressed_size, cudaMemcpyDeviceToHost));

Outputs:

max_compressed_buffer_size: 65888
actual compressed_buffer_size: 65890
Failure

Actual compressed size is larger than max_compressed_buffer_size, and it made the cudaMemcpy fail.

The function calculates the max compressed output size:

  size_t calculate_max_compressed_output_size(CompressionConfig& comp_config) final override
  {
    const size_t comp_buffer_size = max_comp_chunk_size * comp_config.num_chunks;

    const size_t chunk_offsets_size = sizeof(ChunkStartOffset_t) * comp_config.num_chunks;
    const size_t chunk_sizes_size = sizeof(uint32_t) * comp_config.num_chunks;
    // *2 for decomp and comp checksums
    const size_t checksum_size = sizeof(Checksum_t) * comp_config.num_chunks * 2;

    return sizeof(CommonHeader) + sizeof(FormatSpecHeader) + 
        chunk_offsets_size + chunk_sizes_size + checksum_size + comp_buffer_size;
  }

The function do_compress() in class BatchManager:

    // Pad so that the comp chunk offsets are properly aligned
    compress_args.comp_chunk_offsets = roundUpToAlignment<size_t>(comp_buffer);
    compress_args.comp_chunk_sizes = compress_args.comp_chunk_offsets + comp_config.num_chunks;    
    uint32_t* comp_chunk_checksums = reinterpret_cast<uint32_t*>(compress_args.comp_chunk_sizes + comp_config.num_chunks);
    uint32_t* decomp_chunk_checksums = comp_chunk_checksums + comp_config.num_chunks;
    compress_args.comp_buffer = reinterpret_cast<uint8_t*>(decomp_chunk_checksums + comp_config.num_chunks);

The function calculate_max_compressed_output_size() does not take into account the offset caused by address alignment.

Answer 2 · 2022-04-09T13:05:31.000Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

Answer 3 · 2022-07-08T13:12:51.000Z

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

Answer 4 · 2022-10-25T23:13:33.000Z

Could you provide more info about the data you're compressing? I put uniform randomly generated bytes into your code snippet, and I get:

max_compressed_buffer_size: 65888
actual compressed_buffer_size: 65882

I can also confirm that the address in comp_buffer is a multiple of 256, as cudaMalloc should, so shouldn't be affected by the alignment to 8 bytes (sizeof(size_t)). It might be that a bug has been fixed since the version of nvcomp you were using.

Also, sorry for taking so long to investigate this!

Answer 5 · 2022-10-27T14:00:09.000Z

Data is generated by the code in high_level_quickstart_example.cpp. The example runs successfully on version 2.4 and 2.3.3, the output is same as what you get. And what I test before is branch-2.2, so the bug was fixed since version 2.3.
I have a question: Is the meta data of high level API changed since version 2.3? Or is nvcomp support compress on CPU and decompress on GPU by using highlevel API? I have used branch-2.2, add additional meta data required by highlevel api when compress on CPU, and then decompress on GPU by using highlevel API, I don't know if this still work on verison 2.4.

Answer 6 · 2022-10-28T16:09:34.000Z

Ah. Good to know that it seems to be fixed. Thanks! I did see some changes in the BatchManager code since March 1st, (sorry for taking so long to reply), but didn't look too closely.

I don't think the high-level API has changed much since 2.3. I'm not sure whether or not the metadata changed between 2.2 and 2.3. I'm pretty sure it's not intended to be supported for users to manually add the metadata and compact the data in order for the data from the CPU compressors to be able to go into the high-level API GPU decompressors, but maybe we could add something for explicitly setting up that metadata. That way, people wouldn't have to try to figure it out from scratch and hope that it doesn't change. However, you should still be able to use the low-level API GPU decompressors, like in the examples, such as https://github.com/NVIDIA/nvcomp/blob/main/examples/lz4_cpu_compression.cu

Answer 7 · 2022-11-27T17:01:24.000Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

Answer 8 · 2023-02-25T18:01:26.000Z

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.