CUDA 10 :: Thrust sort is throwing exception for device vector of 23330 float elements for gpu architecture 'compute_61'
itczar opened this issue ยท 16 comments
For type float ,if no of elements ,let's say 2330 then there is no issue. But if the number is 23330,then thrust::sort is throwing exception saying "radix sort failed on 2nd step invalid argument".
Please help.
GRAPHICS CARD in use is P2000 and CUDA version is 10
Hi,
This report does not have enough information to be actionable. Please read the guidelines here and provide an updated report.
Initially i thought it was problem related to memory size specifically but it does not seem so.
following code is not behaving correctly with compiler option -gencode arch=compute_61,code=compute_61
int count = 23330;
float *d_vec;
cudaMalloc (&d_vec, sizeof(float)*count); //not populating..just for testing
cudaStreamSynchronize (gpuStream);
checkCudaErrors(cudaGetLastError());
try{
thrust::sort(thrust::system::cuda::par.on(gpuStream), d_vec, d_vec+count);
cudaStreamSynchronize(gpuStream);
}
catch(const std::runtime_error& re)
{
printf("EXCEPTION****************>>>>>>>>>>>>%s",re.what());
}
checkCudaErrors(cudaGetLastError());
**when it is run with thrust 1.9.3 ( shipped with CUDA 10) giving -gencode arch=compute_50,code=compute_50 -gencode arch=compute_61,code=compute_61 as compiler option, it is throwing exception "radix sort failed on 2nd step invalid argument"
Although when option -gencode arch=compute_61,code=compute_61 is removed and only option -gencode arch=compute_50,code=compute_50 is used, it is working fine.**
It seems some problem is related to option compute_61 for compute capability 6.1
This problem is 100% reproducible.
System Details-
OS - Linux redhat based (2.6.32-696.30.1.el6.x86_64)
GPU - Quadro P200 ( compute capability 6.1)
Language - C++
Compiler - g++,ICC
Please provide some workaround or previous stable thrust version compatible with CUDA 10 if possible so that related work may be unblocked.
i ran even with thrust 1.9.2 ( shipped with CUDA9.2) i.e. CUDA 10 + thrust 1.9.2 , it works fine without any issue for compute_61
I have the same problem with excliusive scan. Here is code:
thrust::device_vector<int> vec(3);
vec[0] = 10; vec[1] = 11; vec[2] = 12;
thrust::exclusive_scan(thrust::device, vec.begin(), vec.end(), vec.begin());
and the error I am getting is:
C++ exception with description "scan failed on 2nd step: invalid argument" thrown in the test.
The thing is that this code is randomly failing and sometimes is succesful. I tried cuda 9.0, 9.2 and 10.0 all with arch 6.1 on Titan X GPU.
Anyone has solution? @itczar @brycelelbach
Also having this issue, 6.1 CC GPU w/ CUDA 10.2. As others have said, this was noticed past CUDA 9.0.
Update: After digging into the source of the thrown exception, it is happening inside dispatch_radix_sort.cuh
. I fixed the Thrust Debug variable to true to see the invocation of the kernels.
In my particular case, I am running sort_by_key
which, given the inputs, selects radix sort as the function. For small input sizes, the dispatch selects the single_tile_kernel
, which runs without issue.
For larger input sizes, the dispatch selects the 'Normal problem size invocation' which uses a 3-step pass: an upsweep, a scan, and then a downsweep. In this call, the very first attempt to run upsweep_kernel
throws the error.
The debug output for a ~1000k input size is:
Invoking upsweep_kernel<<<140, 96, 0, 163304944>>>(), 39 items per thread, 10 SM occupancy, current bit 0, bit_grain 6
It seems that the thread count of 96
is causing the problem. When I set it to the next lowest multiple of 32 (64), it executes the kernel without any issue (but the overall function fails eventually due to cudaErrorIllegalAddress issues, likely because the expected memory addresses no longer match).
This is running on a P4000 GPU on CUDA 10.2. I believe there must be an incorrect calculation being done in determining the correct number of threads per block, which is causing this error. Any insights from the nvidia team @brycelelbach? In the meantime I'm going to try to hunt for where this calculation is done to see if I can correct it manually.
Update 2: Following what @itczar mentioned above, I compiled to different SM_XX (my code defaults to the GPU's max capable, which is 6.1)
When I forced it to compile at sm_60 instead of sm_61, sort_by_keys
ran fine with the same kernel invocation inputs. My guess is that there is something wrong with how the calculations are being done for different arch compiles.
Is there any news on this topic? I ran into the same issues (thrust::sort_by_key), but trying to compile for different architectures did not result in working code. I can confirm @Cartoonman that something is going in in dispatch_radix_sort.cuh
.
If this cannot be fixed, is there any any workaround in thrust or a different CUDA-based library I can use?
Windows 10, CUDA 10.2
RTX 6000 with compute capability 7.5
The call of the thrust function looks something like this:
thrust::sort_by_key(thrust::cuda::par.on(myStream), thrust::raw_pointer_cast(d_allDistances), thrust::raw_pointer_cast(d_allDistances) + numberOfTargetCoordinates, thrust::raw_pointer_cast(d_allIndexes), thrust::less<real32_T>());
Thank you very much, any help is very much appreciated!
We really need a minimal test case that reproduced the problem. Please see these guidelines to understand what we're looking for. I haven't been able to reproduce this myself yet.
I have the same issue as mentioned above. When I use
thrust::sort_by_key(thrust::device, pbegin, pend, ibegin);
on a QuadroT1000 the program reports the error mentioned above depending on the input size. If I want to sort 102,499 elements the program runs fine. If I instead choose to sort 102,500 elements, the program fails with the message:
terminate called after throwing an instance of 'thrust::system::system_error'
what(): radix_sort: failed on 2nd step: cudaErrorInvalidConfiguration: invalid configuration argument
I am working on Ubuntu 20.04 and nvcc --version
is: release 10.1, V10.1.243
. Does somebody know what to do in this case?
We've been trying to find a repro for #936 for a while, but haven't been able to replicate / debug it to figure out what's going on. If anyone can find a thrust-only C++ minimal reproduction please share it here so we can take a look.
I suspect that this may have been fixed in CTK 11.4 (Thrust/CUB 1.12) by NVIDIA/cub@63e2ad4, which fixed a lot of overflows that may result in InvalidConfiguration errors.
I don't have a minimal (i.e., thrust-only) or a reliable repro of this, but I did see this error somewhat frequently while working with cuML
(specifically, the UMAP algorithm).
Based on what little I know, I would doubt it was because of the overflow issues mentioned in NVIDIA/cub@63e2ad4, mainly because:
- The error happens randomly for about 1 out of 20 times, but the
cuML
UMAP algorithm was run with a fixed PRNG seed of0
each time (so in theory the algortihm itself should not introduce non-determinism to the process). - The same error either does not happen at all or happens much less frequently than 1 out of 20 times when running with cuda-gdb attached, which is interesting. Could it be one of those really annoying heisenbugs? : / Would cuda-gdb change overflow behavior or cause some overflow to not happen?
- I only have trivial number of data points as input to the algorithm, and I'm sure the total number of bytes in the input or the output of the algorithm will not overflow an int32_t.
Other detail: this happens to both merge_sort and radix_sort in thrust about equally frequently while I was playing around with the UMAP algorithm from cuML
, i.e.,
merge_sort: failed on 2nd step: cudaErrorInvalidValue: invalid argument
radix_sort: failed on 2nd step: cudaErrorInvalidValue: invalid argument
I don't know whether this might be helpful for tracking down the issue.
If it's happening intermittently, you can try running your application through compute-sanitizer
(or cuda-memcheck
on older versions of CUDA) to check for various runtime issues. There may be a hidden race condition or bad memory access that causes the inconsistent failures.
@yitao-li, does it only happen when working with cuml, or are you also able to create a stand-alone reproducer? Which CUDA version are you on? What I'm reading, at least in your case, does remind me a bit of #1400 (comment)