FLAMEGPU/FLAMEGPU2

Linux Debug errors with CUDA 11.0

ptheywood opened this issue · 1 comments

CUDA 11.0 Debug builds on linux are reporting errors at runtime.

This is not an issue in 11.1 or 11.3 (unsure on 11.2 right now).

This occurs for the spatial 3D boids example, unsure if others are effected (I assume so).

This has been confirmed to occur on 2 separate machines (mavericks, blackmass).

/home/ptheywood/code/flamegpu/FLAMEGPU2/include/flamegpu/gpu/CUDAErrorChecking.h(37): CUDA Error: /home/ptheywood/code/flamegpu/FLAMEGPU2/src/flamegpu/gpu/CUDAAgent.cu(263): misaligned address
terminate called after throwing an instance of 'CUDAError'
  what():  /home/ptheywood/code/flamegpu/FLAMEGPU2/include/flamegpu/gpu/CUDAErrorChecking.h(37): CUDA Error: /home/ptheywood/code/flamegpu/FLAMEGPU2/src/flamegpu/gpu/CUDAAgent.cu(263): misaligned address

This occurs during validateIDCollisions, in CUDA 11.0 only, for sufficiently large populations (for circles_spatial3d, somewhere between 4000 and 4300 on a titan v?).

It is caught by the first cuda error check after a call to cub::DeviceReduce::Sum, so it may just be a cub issue? or an nvcc issue?

CUB in CUDA 11.0 should be 1.9.9, 11.1 is 1.9.10, while we are explicitly using 1.10.0 for the improved CMake support.

According to cuda-memcheck, it's Invalid __global__ write of size 16 in

cub::DeviceReduceKernel<cub::DeviceReducePolicy<unsigned int, unsigned int, int, cub::Sum>::Policy600, unsigned int*, unsigned int*, int, cub::Sum>(unsigned int, int, cub::Sum, cub::GridEvenShare<int>, cub::DeviceReducePolicy<unsigned int, unsigned int, int, cub::Sum>::Policy600)

Via cuda-gdb, this occurs at agent_reduce.cuh:259, which is during a vectorised read.

This continues to occur if i bump the Thrust/CUB version to 1.13.0 via cmake.

vec_items is the variable being written to. It is a reinterpret casted <VectorT*> from an InputT[ITEMS_PER_THREAD] which is a local memory / register array subject to the compiler. So feels like a compiler bug? (especially as it is not an issue in more recent CUDA versions?

CUDA 10.0 and 10.2 don't experience this issue. So calling it a CUDA 11.0 specific bug.