CUDA 10 :: Thrust sort is throwing exception for device vector of 23330 float elements for gpu architecture 'compute_61'

Question

CUDA 10 :: Thrust sort is throwing exception for device vector of 23330 float elements for gpu architecture 'compute_61'

itczar opened this issue 6 years ago · 16 comments

For type float ,if no of elements ,let's say 2330 then there is no issue. But if the number is 23330,then thrust::sort is throwing exception saying "radix sort failed on 2nd step invalid argument".
Please help.
GRAPHICS CARD in use is P2000 and CUDA version is 10

Answer 1 · 2018-10-20T02:03:44.000Z

Hi,

This report does not have enough information to be actionable. Please read the guidelines here and provide an updated report.

Answer 2 · 2018-10-21T19:34:39.000Z

Initially i thought it was problem related to memory size specifically but it does not seem so.

following code is not behaving correctly with compiler option -gencode arch=compute_61,code=compute_61

    int count = 23330;
    float *d_vec;
    cudaMalloc (&d_vec, sizeof(float)*count);              //not populating..just for testing
    cudaStreamSynchronize (gpuStream);
    checkCudaErrors(cudaGetLastError());
    try{
    thrust::sort(thrust::system::cuda::par.on(gpuStream), d_vec, d_vec+count);
    cudaStreamSynchronize(gpuStream);
    }
    catch(const std::runtime_error& re)
    {
        printf("EXCEPTION****************>>>>>>>>>>>>%s",re.what());
    }
    checkCudaErrors(cudaGetLastError());

**when it is run with thrust 1.9.3 ( shipped with CUDA 10) giving -gencode arch=compute_50,code=compute_50 -gencode arch=compute_61,code=compute_61 as compiler option, it is throwing exception "radix sort failed on 2nd step invalid argument"

Although when option -gencode arch=compute_61,code=compute_61 is removed and only option -gencode arch=compute_50,code=compute_50 is used, it is working fine.**

It seems some problem is related to option compute_61 for compute capability 6.1
This problem is 100% reproducible.
System Details-
OS - Linux redhat based (2.6.32-696.30.1.el6.x86_64)
GPU - Quadro P200 ( compute capability 6.1)
Language - C++
Compiler - g++,ICC

Please provide some workaround or previous stable thrust version compatible with CUDA 10 if possible so that related work may be unblocked.

Answer 3 · 2018-10-21T20:14:30.000Z

i ran even with thrust 1.9.2 ( shipped with CUDA9.2) i.e. CUDA 10 + thrust 1.9.2 , it works fine without any issue for compute_61

Answer 4 · 2019-01-24T07:43:45.000Z

I have the same problem with excliusive scan. Here is code:

   thrust::device_vector<int> vec(3);
    vec[0] = 10; vec[1] = 11; vec[2] = 12;
    thrust::exclusive_scan(thrust::device, vec.begin(), vec.end(), vec.begin());

and the error I am getting is:

C++ exception with description "scan failed on 2nd step: invalid argument" thrown in the test.

The thing is that this code is randomly failing and sometimes is succesful. I tried cuda 9.0, 9.2 and 10.0 all with arch 6.1 on Titan X GPU.

Anyone has solution? @itczar @brycelelbach

Answer 5 · 2019-12-07T00:00:25.000Z

Also having this issue, 6.1 CC GPU w/ CUDA 10.2. As others have said, this was noticed past CUDA 9.0.

Answer 6 · 2019-12-07T03:07:24.000Z

Update: After digging into the source of the thrown exception, it is happening inside dispatch_radix_sort.cuh. I fixed the Thrust Debug variable to true to see the invocation of the kernels.

In my particular case, I am running sort_by_key which, given the inputs, selects radix sort as the function. For small input sizes, the dispatch selects the single_tile_kernel, which runs without issue.

For larger input sizes, the dispatch selects the 'Normal problem size invocation' which uses a 3-step pass: an upsweep, a scan, and then a downsweep. In this call, the very first attempt to run upsweep_kernel throws the error.

The debug output for a ~1000k input size is:

Invoking upsweep_kernel<<<140, 96, 0, 163304944>>>(), 39 items per thread, 10 SM occupancy, current bit 0, bit_grain 6

It seems that the thread count of 96 is causing the problem. When I set it to the next lowest multiple of 32 (64), it executes the kernel without any issue (but the overall function fails eventually due to cudaErrorIllegalAddress issues, likely because the expected memory addresses no longer match).

This is running on a P4000 GPU on CUDA 10.2. I believe there must be an incorrect calculation being done in determining the correct number of threads per block, which is causing this error. Any insights from the nvidia team @brycelelbach? In the meantime I'm going to try to hunt for where this calculation is done to see if I can correct it manually.

Answer 7 · 2019-12-07T03:37:33.000Z

Update 2: Following what @itczar mentioned above, I compiled to different SM_XX (my code defaults to the GPU's max capable, which is 6.1)

When I forced it to compile at sm_60 instead of sm_61, sort_by_keys ran fine with the same kernel invocation inputs. My guess is that there is something wrong with how the calculations are being done for different arch compiles.

Answer 8 · 2020-02-28T20:42:20.000Z

Is there any news on this topic? I ran into the same issues (thrust::sort_by_key), but trying to compile for different architectures did not result in working code. I can confirm @Cartoonman that something is going in in dispatch_radix_sort.cuh.

If this cannot be fixed, is there any any workaround in thrust or a different CUDA-based library I can use?

Windows 10, CUDA 10.2
RTX 6000 with compute capability 7.5

The call of the thrust function looks something like this:
thrust::sort_by_key(thrust::cuda::par.on(myStream), thrust::raw_pointer_cast(d_allDistances), thrust::raw_pointer_cast(d_allDistances) + numberOfTargetCoordinates, thrust::raw_pointer_cast(d_allIndexes), thrust::less<real32_T>());

Thank you very much, any help is very much appreciated!

Answer 9 · 2020-02-28T21:26:07.000Z

We really need a minimal test case that reproduced the problem. Please see these guidelines to understand what we're looking for. I haven't been able to reproduce this myself yet.

Answer 10 · 2021-04-01T14:25:48.000Z

I have the same issue as mentioned above. When I use

thrust::sort_by_key(thrust::device, pbegin, pend, ibegin);

on a QuadroT1000 the program reports the error mentioned above depending on the input size. If I want to sort 102,499 elements the program runs fine. If I instead choose to sort 102,500 elements, the program fails with the message:

terminate called after throwing an instance of 'thrust::system::system_error'
  what():  radix_sort: failed on 2nd step: cudaErrorInvalidConfiguration: invalid configuration argument

I am working on Ubuntu 20.04 and nvcc --version is: release 10.1, V10.1.243. Does somebody know what to do in this case?

Answer 11 · 2021-10-08T18:05:51.000Z

We've been trying to find a repro for #936 for a while, but haven't been able to replicate / debug it to figure out what's going on. If anyone can find a thrust-only C++ minimal reproduction please share it here so we can take a look.

I suspect that this may have been fixed in CTK 11.4 (Thrust/CUB 1.12) by NVIDIA/cub@63e2ad4, which fixed a lot of overflows that may result in InvalidConfiguration errors.

Answer 12 · 2021-10-20T20:02:45.000Z

I don't have a minimal (i.e., thrust-only) or a reliable repro of this, but I did see this error somewhat frequently while working with cuML (specifically, the UMAP algorithm).

Based on what little I know, I would doubt it was because of the overflow issues mentioned in NVIDIA/cub@63e2ad4, mainly because:

The error happens randomly for about 1 out of 20 times, but the cuML UMAP algorithm was run with a fixed PRNG seed of 0 each time (so in theory the algortihm itself should not introduce non-determinism to the process).
The same error either does not happen at all or happens much less frequently than 1 out of 20 times when running with cuda-gdb attached, which is interesting. Could it be one of those really annoying heisenbugs? : / Would cuda-gdb change overflow behavior or cause some overflow to not happen?
I only have trivial number of data points as input to the algorithm, and I'm sure the total number of bytes in the input or the output of the algorithm will not overflow an int32_t.

Other detail: this happens to both merge_sort and radix_sort in thrust about equally frequently while I was playing around with the UMAP algorithm from cuML, i.e.,

merge_sort: failed on 2nd step: cudaErrorInvalidValue: invalid argument

radix_sort: failed on 2nd step: cudaErrorInvalidValue: invalid argument

I don't know whether this might be helpful for tracking down the issue.

Answer 13 · 2021-10-20T20:20:14.000Z

If it's happening intermittently, you can try running your application through compute-sanitizer (or cuda-memcheck on older versions of CUDA) to check for various runtime issues. There may be a hidden race condition or bad memory access that causes the inconsistent failures.

Answer 14 · 2021-10-29T08:08:22.000Z

@yitao-li, does it only happen when working with cuml, or are you also able to create a stand-alone reproducer? Which CUDA version are you on? What I'm reading, at least in your case, does remind me a bit of #1400 (comment)

Answer 15 · 2021-10-29T19:11:37.000Z

@elstehle Hey thanks for your reply! I'm on CUDA 11.2. I haven't managed to find a stand-alone (i.e., thrust-only) repro of this yet.

Answer 16 · 2021-10-29T20:59:46.000Z

@elstehle I think setting the correct cuda architecture (as suggested in #1400) fixed the issue for me. Thanks a lot for your help!! 👍