RaytraceBenchmark segfaults in debug builds

Here is the output log:

$ cuda-gdb --args BuildProducts/bin/RaytraceBenchmark -on_gpu 1
NVIDIA (R) CUDA Debugger
11.3 release
Portions Copyright (C) 2007-2021 NVIDIA Corporation
GNU gdb (GDB) 8.3.1
Copyright (C) 2019 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from BuildProducts/bin/RaytraceBenchmark...
(cuda-gdb) run
Starting program: /home/amadio/src/adept/build/BuildProducts/bin/RaytraceBenchmark -on_gpu 1
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
INFO: using default trackML.gdml for option -gdml_name
INFO: using default 0 for option -cache_depth
(II) vgdml::Frontend::Load: VecGeom millimeter is 1
[Detaching after fork from child process 1468676]
[Detaching after fork from child process 1468683]
[New Thread 0x7fffdee06000 (LWP 1468684)]
[New Thread 0x7fffde605000 (LWP 1468685)]
[New Thread 0x7fffdde04000 (LWP 1468686)]
INFO: using default 1840 for option -px
INFO: using default 512 for option -py
INFO: using default 2 for option -model
INFO: using default 1 for option -view
INFO: using default 0 for option -reflection
INFO: using default 3.5 for option -zoom
INFO: using default -5000 for option -screenx
INFO: using default 0 for option -screeny
INFO: using default 0 for option -screenz
INFO: using default 0 for option -upx
INFO: using default 1 for option -upy
INFO: using default 0 for option -upz
INFO: using default -128 for option -bkgcol
INFO: using default 0 for option -use_tiles
INFO: using default 8 for option -block_size

Thread 1 "RaytraceBenchma" received signal SIGSEGV, Segmentation fault.
0x0000555555706e30 in adept::SparseVectorInterface<cuda::Ray_t>::SparseVectorInterface 
(this=0x7fff54000000, totalCapacity=4194304)
    at /home/amadio/src/adept/base/inc/AdePT/SparseVector.h:344
344	  __host__ __device__ SparseVectorInterface(size_t totalCapacity)

The configuration command I used was:

$ cmake  -DCMAKE_BUILD_TYPE="Debug" -DCMAKE_CUDA_ARCHITECTURES="75"
  -DCMAKE_CXX_STANDARD="17" -DCMAKE_PREFIX_PATH="/tmp/vecgeom" ~/src/adept

I reproduced the error only when I used the following configuration command:

cmake -DCMAKE_BUILD_TYPE="Debug" -DCMAKE_CUDA_ARCHITECTURES="75" -DVecCore_DIR=<path_veccore_installation>/lib/cmake/VecCore/ -DCMAKE_PREFIX_PATH="<path_vecgeom_installation>" ..

After I added the VecCore installation path in the DCMAKE_PREFIX_PATH or used DVecCore_DIR and DVecGeom_DIR variables, it worked.

Hi, can you clarify what you mean by it worked? The code builds for me, but running the example crashes. Do you mean that you reproduced the crash? I don't think it's related to VecCore from the error message. I am using VecCore which I have installed under /usr/include, so I don't need to add it to CMAKE_PREFIX_PATH.

When I used the configuration command from my last response, I received Segmentation fault as well.

I added VecCore in /usr/local. I don't get Segmentation fault if I use the following configuration command:

cmake DCMAKE_BUILD_TYPE="Debug" -DCMAKE_PREFIX_PATH="<vecgeom_path>" -DCMAKE_CUDA_FLAGS="--generate-code=arch=compute_75,code=[compute_75,sm_75]" ..

FWIW, I can reproduce this crash on my machine. One important detail that I'd to point out is that it's crashing on the host:

Thread 1 "RaytraceBenchma" received signal SIGSEGV, Segmentation fault.
0x00000000004a3a28 in adept::SparseVectorInterface<cuda::Ray_t>::SparseVectorInterface (this=0x7fff80000000, totalCapacity=4194304) at /home/jhahnfel/AdePT/src/base/inc/AdePT/SparseVector.h:344
344       __host__ __device__ SparseVectorInterface(size_t totalCapacity)
Missing separate debuginfos, use: yum debuginfo-install libgcc-8.3.1-5.1.el8.x86_64 libstdc++-8.3.1-5.1.el8.x86_64 nvidia-driver-cuda-libs-460.32.03-1.el8.x86_64 xerces-c-3.2.2-3.el8.x86_64
(gdb) bt
#0  0x00000000004a3a28 in adept::SparseVectorInterface<cuda::Ray_t>::SparseVectorInterface (this=0x7fff80000000, totalCapacity=4194304) at /home/jhahnfel/AdePT/src/base/inc/AdePT/SparseVector.h:344
#1  0x00000000004a3535 in adept::SparseVector<cuda::Ray_t, 4194304u>::SparseVector (this=0x7fff80000000, totalCapacity=4194304) at /home/jhahnfel/AdePT/src/base/inc/AdePT/SparseVector.h:698
#2  0x00000000004a3024 in adept::SparseVector<cuda::Ray_t, 4194304u>::MakeInstanceAt (addr=0x7fff80000000) at /home/jhahnfel/AdePT/src/base/inc/AdePT/SparseVector.h:726
#3  0x00000000004a2025 in InitRTdata<(copcore::BackendType)1> (rtdata=0x7fffb4004000, volume_container=0x7fffb4000000, no_generations=1, logicalvolumes=std::vector of length 145, capacity 145 = {...})
    at /home/jhahnfel/AdePT/src/examples/Raytracer_Benchmark/RaytraceBenchmark.hpp:57
#4  0x00000000004a0d20 in runSimulation<(copcore::BackendType)1> (volume_container=0x7fffb4000000, world=0x7fffec3be850, logicalvolumes=std::vector of length 145, capacity 145 = {...}, argc=1, argv=0x7fffffffe038)
    at /home/jhahnfel/AdePT/src/examples/Raytracer_Benchmark/RaytraceBenchmark.hpp:153
#5  0x000000000049db76 in executePipelineGPU (volume_container=0x7fffb4000000, world=0x7fffec3be850, logicalvolumes=std::vector of length 145, capacity 145 = {...}, argc=1, argv=0x7fffffffe038)
    at /home/jhahnfel/AdePT/src/examples/Raytracer_Benchmark/RaytraceBenchmark.cu:216
#6  0x000000000041ed6a in main (argc=1, argv=0x7fffffffe038) at /home/jhahnfel/AdePT/src/examples/Raytracer_Benchmark/RaytraceBenchmark.cpp:85

If I look at InitRTdata, the logic seems seriously flawed to me when calling it with copcore::BackendType::CUDA:

AdePT/examples/Raytracer_Benchmark/RaytraceBenchmark.hpp

Lines 37 to 61 in 0165af5

    
           template <copcore::BackendType backend> 
        
           void InitRTdata(RaytracerData_t *rtdata, const MyMediumProp *volume_container, int no_generations, 
        
                           std::vector<vecgeom::cxx::LogicalVolume *> logicalvolumes) 
        
           { 
        
             Vector_t *x; 
        
             if (backend == copcore::BackendType::CUDA) { 
        
               initiliazeCudaWorld((RaytracerData_t *)rtdata, volume_container, logicalvolumes); 
        
               COPCORE_CUDA_CHECK(cudaMalloc(&x, no_generations * sizeof(Vector_t))); 
        
             } else { 
        
               vecgeom::NavStateIndex vpstate; 
        
               LoopNavigator::LocatePointIn(rtdata->fWorld, rtdata->fStart, vpstate, true); 
        
               rtdata->fVPstate = vpstate; 
        
               // COPCORE_CUDA_CHECK(cudaMallocManaged(&x, no_generations * sizeof(Vector_t))); 
        
               x = (Vector_t *)malloc(no_generations * sizeof(Vector_t)); 
        
             } 
        
             for (int i = 0; i < no_generations; ++i) { 
        
               Vector_t::MakeInstanceAt(&x[i]); 
        
             } 
        
             rtdata->sparse_rays = x; 
        
           }

It calls cudaMalloc on line 45 which allocates GPU memory.
But line 57 calls Vector_t::MakeInstanceAt on the host, which eventually fails with a SEGV in this assembly code:

(gdb) disassemble 
Dump of assembler code for function adept::SparseVectorInterface<cuda::Ray_t>::SparseVectorInterface(unsigned long):
   0x00000000004a3a14 <+0>:     push   %rbp
   0x00000000004a3a15 <+1>:     mov    %rsp,%rbp
   0x00000000004a3a18 <+4>:     sub    $0x10,%rsp
   0x00000000004a3a1c <+8>:     mov    %rdi,-0x8(%rbp)
   0x00000000004a3a20 <+12>:    mov    %rsi,-0x10(%rbp)
   0x00000000004a3a24 <+16>:    mov    -0x8(%rbp),%rax
=> 0x00000000004a3a28 <+20>:    movq   $0x0,(%rax)
   0x00000000004a3a2f <+27>:    mov    -0x8(%rbp),%rax
   0x00000000004a3a33 <+31>:    movq   $0x0,0x8(%rax)
   0x00000000004a3a3b <+39>:    mov    -0x8(%rbp),%rax

I read this code (and the given source line location at the first line of the constructor) as initializing the member variables to zero. Which makes sense because the host should not write to GPU memory via the memory bus.

What I have no idea about (but also not very interested to dig deeper) is why it works on certain machines, and more interestingly why it works in Release mode. Maybe the compiler optimizes away the bad code, I don't know.

After discussing with @antoniopetre, I agree with him that calling MakeInstanceAt is actually fine. It might be that the called function is_device_pointer violated the one-definition rule and the linker picked the wrong one. @amadio could you test if #123 also fixes the issue for you?

Yes, this does seem to fix the problem. Please go ahead and merge, we can deal with moving implementations to a source file or not later.

	template <copcore::BackendType backend>
	void InitRTdata(RaytracerData_t rtdata, const MyMediumProp volume_container, int no_generations,
	std::vector<vecgeom::cxx::LogicalVolume *> logicalvolumes)
	{
	Vector_t *x;

	if (backend == copcore::BackendType::CUDA) {
	initiliazeCudaWorld((RaytracerData_t *)rtdata, volume_container, logicalvolumes);
	COPCORE_CUDA_CHECK(cudaMalloc(&x, no_generations * sizeof(Vector_t)));

	} else {
	vecgeom::NavStateIndex vpstate;
	LoopNavigator::LocatePointIn(rtdata->fWorld, rtdata->fStart, vpstate, true);
	rtdata->fVPstate = vpstate;

	// COPCORE_CUDA_CHECK(cudaMallocManaged(&x, no_generations * sizeof(Vector_t)));
	x = (Vector_t )malloc(no_generations sizeof(Vector_t));
	}

	for (int i = 0; i < no_generations; ++i) {
	Vector_t::MakeInstanceAt(&x[i]);
	}

	rtdata->sparse_rays = x;
	}