NVIDIA/nvcomp

[BUG] Cascaded Illegal Memory Access with High Compression Ratio

NicolasDenoyelle opened this issue · 10 comments

Describe the bug

nvcomp cascaded manager runs into an illegal memory access when reading the compressed output size after compressing highly compressible data.

Steps/Code to reproduce bug

Reproducer: nvcomp-reproducer.cpp.txt

Compile with:

g++ -g -Wall -Wextra -Wall -std=c++17 -I/usr/local/cuda/include -I<path_to_nvcomp.h> nvcomp-reproducer.cpp -o reproducer -lcudart -lnvcomp && ./reproducer  

Expected behavior

The reproducer I attached compresses two buffers of the same size with the same cascaded manager.
One buffer is filled with random data and works fine.
The other buffer is filled with a sequence of {0,1,0,1,...} and triggers the error.
In the reproducer, I took care of making sure that buffer sizes are appropriate:

  • a multiple of the nvcompType_t being compressed
  • a multiple of the cascaded interface chunk sizes
  • at least the maximum compressed size given by nvcomp prior to compression.

Environment details:

  • Bare-metal
  • Cuda Driver Version: 535.154.05
  • CUDA Version: 12.2
  • NVIDIA GeForce RTX 3050
  • nvcomp Version: 3.0.4 (no extensions)

Unfortunately, it seems like I can't access the nvcomp-reproducer.cpp.txt file you linked to anymore. I think I looked at it briefly a week or so ago, so it was probably there at one point. Could you post it again? I think I might have found something, though I don't know whether it would cause this or not, without the repro code.

Below is the code.
I tried to run it again now with version 3.0.6 instead of 3.0.4 and the bug looks fixed.

#include <cuda_runtime.h>
#include <nvcomp/shared_types.h>
#include <cassert>
#include <cstdint>
#include <cstring>
#include <fstream>
#include <iostream>
#include <nvcomp.hpp>
#include <nvcomp/cascaded.hpp>
#include <nvcomp/nvcompManager.hpp>
#include <vector>

static void AbortWithError(const char* file, int line, const char* call,
                           const char* err) {
  std::cerr << file << ":" << line << " " << call << ": " << err << std::endl;
  abort();
}

#define ASSERT_CUDA_SUCCESS(call)                                                \
  do {                                                                           \
    const auto cudaResult = call;                                                \
    if (cudaResult != cudaSuccess) {                                             \
      AbortWithError(__FILE__, __LINE__, #call, cudaGetErrorString(cudaResult)); \
    }                                                                            \
  } while (0)

class DeviceBuffer {
 public:
  ~DeviceBuffer() {
    if (data_) {
      ASSERT_CUDA_SUCCESS(cudaFree(data_));
    }
  }
  DeviceBuffer() : data_(nullptr), size_(0), capacity_(0) {}
  DeviceBuffer(size_t size) : size_(size), capacity_(size) {
    ASSERT_CUDA_SUCCESS(cudaMalloc(&data_, size));
  }
  DeviceBuffer(const std::vector<char>& host_buffer) : DeviceBuffer(host_buffer.size()) {
    ASSERT_CUDA_SUCCESS(
        cudaMemcpy(data_, host_buffer.data(), size_, cudaMemcpyHostToDevice));
  }
  DeviceBuffer(const DeviceBuffer&) = delete;
  DeviceBuffer(DeviceBuffer&& other) : DeviceBuffer() { *this = std::move(other); }
  DeviceBuffer& operator=(const DeviceBuffer&) = delete;
  DeviceBuffer& operator=(DeviceBuffer&& other) {
    std::swap(data_, other.data_);
    std::swap(size_, other.size_);
    std::swap(capacity_, other.capacity_);
    return *this;
  }

  void resize(std::size_t size) {
    if (size <= capacity_) {
      size_ = size;
    } else {
      void* data;
      ASSERT_CUDA_SUCCESS(cudaMalloc(&data, size));
      ASSERT_CUDA_SUCCESS(cudaMemcpy(data, data_, size_, cudaMemcpyDeviceToDevice));
      ASSERT_CUDA_SUCCESS(cudaFree(data_));
      data_ = data;
      size_ = size;
      capacity_ = size;
    }
  }
  [[nodiscard]] size_t size() const { return size_; }
  [[nodiscard]] uint8_t* data() { return reinterpret_cast<uint8_t*>(data_); }
  [[nodiscard]] const uint8_t* data() const {
    return reinterpret_cast<const uint8_t*>(data_);
  }

  bool operator==(const DeviceBuffer& other) const {
    if (other.size_ != size_) {
      return false;
    }

    std::vector<char> lhs(size_, 0);
    std::vector<char> rhs(other.size_, 0);
    ASSERT_CUDA_SUCCESS(cudaMemcpy(lhs.data(), data_, size(), cudaMemcpyDeviceToHost));
    ASSERT_CUDA_SUCCESS(
        cudaMemcpy(rhs.data(), other.data_, other.size(), cudaMemcpyDeviceToHost));
    return lhs == rhs;
  }

 private:
  void* data_;
  size_t size_;
  size_t capacity_;
};

static void CompressionRoundTrip(nvcomp::nvcompManagerBase& manager,
                                 const DeviceBuffer& input,
                                 const size_t decomp_chunk_size) {
  auto decompressed = DeviceBuffer(input.size());
  std::cerr << "Compressing " << input.size() << " Bytes." << std::endl;

  auto compression_cfg = std::make_unique<nvcomp::CompressionConfig>(
      manager.configure_compression(input.size()));
  assert(*compression_cfg->get_status() == nvcompSuccess);

  DeviceBuffer compressed(
      std::max(decomp_chunk_size, compression_cfg->max_compressed_buffer_size));
  manager.compress(input.data(), compressed.data(), *compression_cfg);
  cudaStreamSynchronize(cudaStreamDefault);

  auto compressed_size = manager.get_compressed_output_size(compressed.data());
  assert(*compression_cfg->get_status() == nvcompSuccess);
  compressed.resize(compressed_size);

  std::cerr << "Compressed into " << compressed_size << " Bytes." << std::endl;

  auto decompression_cfg = std::make_unique<nvcomp::DecompressionConfig>(
      manager.configure_decompression(compressed.data()));
  assert(*decompression_cfg->get_status() == nvcompSuccess);
  manager.decompress(decompressed.data(), compressed.data(), *decompression_cfg);
  assert(decompressed == input);
}

static std::vector<char> MakeRegularInputData(const size_t size) {
  std::vector<char> data(size, 0);
  for (size_t i = 0; i < size; i++) {
    data[i] = i % 2;
  }
  return data;
}

static std::vector<char> MakeRandomInputData(const size_t size) {
  std::vector<char> data(size, 0);
  for (size_t i = 0; i < size; i++) {
    data[i] = rand() % 2;
  }
  return data;
}

int main() {
  constexpr size_t kSize = 1 << 20;
  constexpr size_t kDecompChunkSize = 4096;
  constexpr size_t kCompChunkSize = 4096;
  constexpr nvcompBatchedCascadedOpts_t opts = {kDecompChunkSize, NVCOMP_TYPE_SHORT, 2, 2,
                                                0};
  nvcomp::CascadedManager manager(kCompChunkSize, opts,
                                  static_cast<cudaStream_t>(cudaStreamDefault), 0,
                                  NoComputeNoVerify);

  CompressionRoundTrip(manager, DeviceBuffer(MakeRandomInputData(kSize)),
                       kDecompChunkSize); // Success
  CompressionRoundTrip(manager, DeviceBuffer(MakeRegularInputData(kSize)),
                       kDecompChunkSize); // Error
  return 0;
}

It looks like this isn't fixed in the latest nvcomp, and it might just be a coincidence that it happens to no longer crash for you. The bug seems to occur when there's a chunk that consists entirely of 1 repeated value and there's at least 1 RLE pass and 2 delta passes. The RLE is run first, reducing it to a single value and a run length, and outputting the run length. The first delta just outputs the one value, because it has no following value to subtract it from. On the second delta pass, there are no values left, and there's a chance that it may crash. I'll try to fix the handling of the case where it runs out of values to process.

Ok,thanks for looking into this.
I am reopening the issue so you can close it when you think the bug is addressed.
I hope you can trigger the crash on your side. 🤞

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@ndickson-nvidia I am attaching a reproducer here.
I was able to trigger again the same issue with a buffer filled with 0s and other nvcomp cascaded options.

Expected behaviour.

nvcomp-reproducer-cascaded.cpp.txt
Upon execution of the reproducer you should see the following error output:

Compressing 8192 Bytes.
terminate called after throwing an instance of 'std::runtime_error'
  what():  Encountered Cuda Error: 700: 'an illegal memory access was encountered'.
Aborted (core dumped)

The same error occur with the following cascaded options:

{.chunk_size = 4096, .type = NVCOMP_TYPE_INT, .num_RLEs = 1, .num_deltas = 2, .use_bp = 0}
{.chunk_size = 4096, .type = NVCOMP_TYPE_INT, .num_RLEs = 1, .num_deltas = 2, .use_bp = 1}
{.chunk_size = 4096, .type = NVCOMP_TYPE_INT, .num_RLEs = 2, .num_deltas = 2, .use_bp = 0}
{.chunk_size = 4096, .type = NVCOMP_TYPE_INT, .num_RLEs = 2, .num_deltas = 2, .use_bp = 1}

Environment details:

Bare-metal
Cuda Driver Version: 535.171.04
CUDA Version: 12.2
NVIDIA GeForce RTX 3050
nvcomp Version: 3.0.6 (no extensions)