sandialabs/omega_h

cuda 11.4 compiler errors in thrust::transform_inclusive_scan

cwsmith opened this issue · 4 comments

Building omega_h 9.33.4 with cuda 11.4 and gcc 10.2 results in what appear to be thrust::transform_inclusive_scan errors.

Attempting to compile the regression test code from the 11.2 issue with nvcc -ccbin=g++-10 -arch=sm_75 dan.cu -o dan results in what appear to be the same errors (see below).

The thrust release notes (https://github.com/NVIDIA/thrust/releases) for the version included with CUDA 11.4, 1.12.0 mention "New asynchronous thrust::async:exclusive_scan and inclusive_scan algorithms have been added, and the synchronous versions of these have been updated to use cub::DeviceScan directly.". The error message is in cub code...

If you don't see anything obviously wrong with my attempt to compile then I'm guessing this should be posted as an issue in the Thrust repo. The issue in the thrust repo is here NVIDIA/thrust#1506.

CUDA/nvcc version

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Wed_Jun__2_19:15:15_PDT_2021
Cuda compilation tools, release 11.4, V11.4.48
Build cuda_11.4.r11.4/compiler.30033411_0

regression code compilation error

/opt/cuda/bin/../targets/x86_64-linux/include/cub/device/dispatch/../../agent/agent_scan.cuh(294): error: const variable "items" requires an initializer
          detected during:
            instantiation of "void cub::AgentScan<AgentScanPolicyT, InputIteratorT, OutputIteratorT, ScanOpT, InitValueT, OffsetT>::ConsumeTile<IS_LAST_TILE>(OffsetT, int, OffsetT, cub::AgentScan<AgentScanPolicyT, InputIteratorT, OutputIteratorT, ScanOpT, InitValueT, OffsetT>::ScanTileStateT &) [with AgentScanPolicyT=cub::AgentScanPolicy<128, 12, const LO, cub::BLOCK_LOAD_DIRECT, cub::LOAD_LDG, cub::BLOCK_STORE_WARP_TRANSPOSE_TIMESLICED, cub::BLOCK_SCAN_RAKING, cub::MemBoundScaling<128, 12, const LO>>, InputIteratorT=thrust::cuda_cub::transform_input_iterator_t<const LO, LO *, thrust::identity<LO>>, OutputIteratorT=LO *, ScanOpT=thrust::maximum<LO>, InitValueT=cub::NullType, OffsetT=thrust::detail::int32_t, IS_LAST_TILE=false]" 
(355): here 
            instantiation of "void cub::AgentScan<AgentScanPolicyT, InputIteratorT, OutputIteratorT, ScanOpT, InitValueT, OffsetT>::ConsumeRange(OffsetT, cub::AgentScan<AgentScanPolicyT, InputIteratorT, OutputIteratorT, ScanOpT, InitValueT, OffsetT>::ScanTileStateT &, int) [with AgentScanPolicyT=cub::AgentScanPolicy<128, 12, const LO, cub::BLOCK_LOAD_DIRECT, cub::LOAD_LDG, cub::BLOCK_STORE_WARP_TRANSPOSE_TIMESLICED, cub::BLOCK_SCAN_RAKING, cub::MemBoundScaling<128, 12, const LO>>, InputIteratorT=thrust::cuda_cub::transform_input_iterator_t<const LO, LO *, thrust::identity<LO>>, OutputIteratorT=LO *, ScanOpT=thrust::maximum<LO>, InitValueT=cub::NullType, OffsetT=thrust::detail::int32_t]" 
/opt/cuda/bin/../targets/x86_64-linux/include/cub/device/dispatch/dispatch_scan.cuh(131): here 
            instantiation of "void cub::DeviceScanKernel<ScanPolicyT,InputIteratorT,OutputIteratorT,ScanTileStateT,ScanOpT,InitValueT,OffsetT>(InputIteratorT, OutputIteratorT, ScanTileStateT, int, ScanOpT, InitValueT, OffsetT) [with ScanPolicyT=cub::AgentScanPolicy<128, 12, const LO, cub::BLOCK_LOAD_DIRECT, cub::LOAD_LDG, cub::BLOCK_STORE_WARP_TRANSPOSE_TIMESLICED, cub::BLOCK_SCAN_RAKING, cub::MemBoundScaling<128, 12, const LO>>, InputIteratorT=thrust::cuda_cub::transform_input_iterator_t<const LO, LO *, thrust::identity<LO>>, OutputIteratorT=LO *, ScanTileStateT=cub::ScanTileState<const LO, true>, ScanOpT=thrust::maximum<LO>, InitValueT=cub::NullType, OffsetT=thrust::detail::int32_t]" 
/opt/cuda/bin/../targets/x86_64-linux/include/cub/device/dispatch/dispatch_scan.cuh(396): here 
            instantiation of "cudaError_t cub::DispatchScan<InputIteratorT, OutputIteratorT, ScanOpT, InitValueT, OffsetT, SelectedPolicy>::Invoke<ActivePolicyT>() [with InputIteratorT=thrust::cuda_cub::transform_input_iterator_t<const LO, LO *, thrust::identity<LO>>, OutputIteratorT=LO *, ScanOpT=thrust::maximum<LO>, InitValueT=cub::NullType, OffsetT=thrust::detail::int32_t, SelectedPolicy=cub::DeviceScanPolicy<const LO>, ActivePolicyT=cub::DeviceScanPolicy<const LO>::Policy350]" 
/opt/cuda/bin/../targets/x86_64-linux/include/cub/block/../iterator/../util_device.cuh(706): here 
            instantiation of "cudaError_t cub::ChainedPolicy<PTX_VERSION, PolicyT, PolicyT>::Invoke(int, FunctorT &) [with PTX_VERSION=350, PolicyT=cub::DeviceScanPolicy<const LO>::Policy350, FunctorT=cub::DispatchScan<thrust::cuda_cub::transform_input_iterator_t<const LO, LO *, thrust::identity<LO>>, LO *, thrust::maximum<LO>, cub::NullType, thrust::detail::int32_t, cub::DeviceScanPolicy<const LO>>]" 
/opt/cuda/bin/../targets/x86_64-linux/include/cub/block/../iterator/../util_device.cuh(689): here 
            [ 2 instantiation contexts not shown ]
            instantiation of "cudaError_t cub::DispatchScan<InputIteratorT, OutputIteratorT, ScanOpT, InitValueT, OffsetT, SelectedPolicy>::Dispatch(void *, size_t &, InputIteratorT, OutputIteratorT, ScanOpT, InitValueT, OffsetT, cudaStream_t, __nv_bool) [with InputIteratorT=thrust::cuda_cub::transform_input_iterator_t<const LO, LO *, thrust::identity<LO>>, OutputIteratorT=LO *, ScanOpT=thrust::maximum<LO>, InitValueT=cub::NullType, OffsetT=thrust::detail::int32_t, SelectedPolicy=cub::DeviceScanPolicy<const LO>]" 
/opt/cuda/bin/../targets/x86_64-linux/include/thrust/system/cuda/detail/scan.h(78): here 
            instantiation of "OutputIt thrust::cuda_cub::detail::inclusive_scan_n_impl(thrust::cuda_cub::execution_policy<Derived> &, InputIt, Size, OutputIt, ScanOp) [with Derived=thrust::cuda_cub::par_t, InputIt=thrust::cuda_cub::transform_input_iterator_t<const LO, LO *, thrust::identity<LO>>, Size=std::ptrdiff_t, OutputIt=LO *, ScanOp=thrust::maximum<LO>]" 
/opt/cuda/bin/../targets/x86_64-linux/include/thrust/system/cuda/detail/scan.h(228): here 
            instantiation of "OutputIt thrust::cuda_cub::inclusive_scan_n(thrust::cuda_cub::execution_policy<Derived> &, InputIt, Size, OutputIt, ScanOp) [with Derived=thrust::cuda_cub::par_t, InputIt=thrust::cuda_cub::transform_input_iterator_t<const LO, LO *, thrust::identity<LO>>, Size=std::ptrdiff_t, OutputIt=LO *, ScanOp=thrust::maximum<LO>]" 
/opt/cuda/bin/../targets/x86_64-linux/include/thrust/system/cuda/detail/transform_scan.h(74): here 
            instantiation of "OutputIt thrust::cuda_cub::transform_inclusive_scan(thrust::cuda_cub::execution_policy<Derived> &, InputIt, InputIt, OutputIt, TransformOp, ScanOp) [with Derived=thrust::cuda_cub::par_t, InputIt=LO *, OutputIt=LO *, TransformOp=thrust::identity<LO>, ScanOp=thrust::maximum<LO>]" 
/opt/cuda/bin/../targets/x86_64-linux/include/thrust/detail/transform_scan.inl(47): here 
            instantiation of "OutputIterator thrust::transform_inclusive_scan(const thrust::detail::execution_policy_base<DerivedPolicy> &, InputIterator, InputIterator, OutputIterator, UnaryFunction, AssociativeOperator) [with DerivedPolicy=thrust::cuda_cub::par_t, InputIterator=LO *, OutputIterator=LO *, UnaryFunction=thrust::identity<LO>, AssociativeOperator=thrust::maximum<LO>]" 
dan.cu(23): here 

Omega_h CMake command

cmake ../omega_h \
  -DCMAKE_BUILD_TYPE=DEBUG \
  -DCMAKE_INSTALL_PREFIX=$PWD/install \
  -DBUILD_TESTING=on  \
  -DOmega_h_USE_CUDA=on \
  -DOmega_h_USE_MPI=off \
  -DBUILD_SHARED_LIBS=ON \
  -DCMAKE_CUDA_FLAGS='-arch=sm_75' \
  -DCMAKE_CXX_COMPILER=g++-10 \
  -DCMAKE_CUDA_HOST_COMPILER=g++-10

Omega_h compilation error

Note, this is just the initial portion of the full error that goes on for many pages (the 100 error limit is reached).

/opt/cuda/bin/nvcc -forward-unknown-to-host-compiler -ccbin=g++-10 -Domega_h_EXPORTS -I/home/cwsmith/develop/omega_h/src -I/home/cwsmith/develop/buildOmegah/src -I/home/cwsmith/develop/omega_h/tpl -arch=sm_75 -g --generate-code=arch=compute_75,code=[compute_75,sm_75] -Xcompiler=-fPIC --compiler-options -W,-Wall,-Wextra,-Werror,-Wno-noexcept-type --Werror cross-execution-space-call,deprecated-declarations --expt-extended-lambda -std=c++14 -MD -MT src/CMakeFiles/omega_h.dir/Omega_h_int_scan.cpp.o -MF CMakeFiles/omega_h.dir/Omega_h_int_scan.cpp.o.d -x cu -c /home/cwsmith/develop/omega_h/
src/Omega_h_int_scan.cpp -o CMakeFiles/omega_h.dir/Omega_h_int_scan.cpp.o                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
/opt/cuda/bin/../targets/x86_64-linux/include/cub/device/dispatch/../../agent/agent_scan.cuh(294): error: const variable "items" requires an initializer                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
          detected during:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
            instantiation of "void cub::AgentScan<AgentScanPolicyT, InputIteratorT, OutputIteratorT, ScanOpT, InitValueT, OffsetT>::ConsumeTile<IS_LAST_TILE>(OffsetT, int, OffsetT, cub::AgentScan<AgentScanPolicyT, InputIteratorT, OutputIteratorT, ScanOpT, InitValueT, OffsetT>::ScanTileStateT &) [with AgentScanPolicyT=cub::AgentScanPolicy<128, 12, const Omega_h::LO, cub::BLOCK_LOAD_DIRECT, cub::LOAD_LDG, cub::BLOCK_STORE_WARP_TRANSPOSE_TIMESLICED, cub::BLOCK_SCAN_RAKING, cub::MemBoundScaling<128, 12, const Omega_h::LO>>, InputIteratorT=thrust::cuda_cub::transform_input_iterator_t<const Omega_h::LO, Omega_h::LO *, thrust::identit
y<Omega_h::LO>>, OutputIteratorT=Omega_h::LO *, ScanOpT=thrust::maximum<Omega_h::LO>, InitValueT=cub::NullType, OffsetT=thrust::detail::int32_t, IS_LAST_TILE=false]"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
(355): here                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
            instantiation of "void cub::AgentScan<AgentScanPolicyT, InputIteratorT, OutputIteratorT, ScanOpT, InitValueT, OffsetT>::ConsumeRange(OffsetT, cub::AgentScan<AgentScanPolicyT, InputIteratorT, OutputIteratorT, ScanOpT, InitValueT, OffsetT>::ScanTileStateT &, int) [with AgentScanPolicyT=cub::AgentScanPolicy<128, 12, const Omega_h::LO, cub::BLOCK_LOAD_DIRECT, cub::LOAD_LDG, cub::BLOCK_STORE_WARP_TRANSPOSE_TIMESLICED, cub::BLOCK_SCAN_RAKING, cub::MemBoundScaling<128, 12, const Omega_h::LO>>, InputIteratorT=thrust::cuda_cub::transform_input_iterator_t<const Omega_h::LO, Omega_h::LO *, thrust::identity<Omega_h::LO>>, Outpu
tIteratorT=Omega_h::LO *, ScanOpT=thrust::maximum<Omega_h::LO>, InitValueT=cub::NullType, OffsetT=thrust::detail::int32_t]"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
/opt/cuda/bin/../targets/x86_64-linux/include/cub/device/dispatch/dispatch_scan.cuh(131): here                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
            instantiation of "void cub::DeviceScanKernel<ScanPolicyT,InputIteratorT,OutputIteratorT,ScanTileStateT,ScanOpT,InitValueT,OffsetT>(InputIteratorT, OutputIteratorT, ScanTileStateT, int, ScanOpT, InitValueT, OffsetT) [with ScanPolicyT=cub::AgentScanPolicy<128, 12, const Omega_h::LO, cub::BLOCK_LOAD_DIRECT, cub::LOAD_LDG, cub::BLOCK_STORE_WARP_TRANSPOSE_TIMESLICED, cub::BLOCK_SCAN_RAKING, cub::MemBoundScaling<128, 12, const Omega_h::LO>>, InputIteratorT=thrust::cuda_cub::transform_input_iterator_t<const Omega_h::LO, Omega_h::LO *, thrust::identity<Omega_h::LO>>, OutputIteratorT=Omega_h::LO *, ScanTileStateT=cub::ScanTi
leState<const Omega_h::LO, true>, ScanOpT=thrust::maximum<Omega_h::LO>, InitValueT=cub::NullType, OffsetT=thrust::detail::int32_t]"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
/opt/cuda/bin/../targets/x86_64-linux/include/cub/device/dispatch/dispatch_scan.cuh(396): here                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
            instantiation of "cudaError_t cub::DispatchScan<InputIteratorT, OutputIteratorT, ScanOpT, InitValueT, OffsetT, SelectedPolicy>::Invoke<ActivePolicyT>() [with InputIteratorT=thrust::cuda_cub::transform_input_iterator_t<const Omega_h::LO, Omega_h::LO *, thrust::identity<Omega_h::LO>>, OutputIteratorT=Omega_h::LO *, ScanOpT=thrust::maximum<Omega_h::LO>, InitValueT=cub::NullType, OffsetT=thrust::detail::int32_t, SelectedPolicy=cub::DeviceScanPolicy<const Omega_h::LO>, ActivePolicyT=cub::DeviceScanPolicy<const Omega_h::LO>::Policy350]"                                                                                       
/opt/cuda/bin/../targets/x86_64-linux/include/cub/block/../iterator/../util_device.cuh(706): here                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          
            instantiation of "cudaError_t cub::ChainedPolicy<PTX_VERSION, PolicyT, PolicyT>::Invoke(int, FunctorT &) [with PTX_VERSION=350, PolicyT=cub::DeviceScanPolicy<const Omega_h::LO>::Policy350, FunctorT=cub::DispatchScan<thrust::cuda_cub::transform_input_iterator_t<const Omega_h::LO, Omega_h::LO *, thrust::identity<Omega_h::LO>>, Omega_h::LO *, thrust::maximum<Omega_h::LO>, cub::NullType, thrust::detail::int32_t, cub::DeviceScanPolicy<const Omega_h::LO>>]"                                                                                                                                                                        
/opt/cuda/bin/../targets/x86_64-linux/include/cub/block/../iterator/../util_device.cuh(689): here                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          
            [ 3 instantiation contexts not shown ]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
            instantiation of "OutputIt thrust::cuda_cub::detail::inclusive_scan_n_impl(thrust::cuda_cub::execution_policy<Derived> &, InputIt, Size, OutputIt, ScanOp) [with Derived=thrust::cuda_cub::par_t, InputIt=thrust::cuda_cub::transform_input_iterator_t<const Omega_h::LO, Omega_h::LO *, thrust::identity<Omega_h::LO>>, Size=std::ptrdiff_t, OutputIt=Omega_h::LO *, ScanOp=thrust::maximum<Omega_h::LO>]"                                                                                                                                                                                                                                    
/opt/cuda/bin/../targets/x86_64-linux/include/thrust/system/cuda/detail/scan.h(228): here                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
            instantiation of "OutputIt thrust::cuda_cub::inclusive_scan_n(thrust::cuda_cub::execution_policy<Derived> &, InputIt, Size, OutputIt, ScanOp) [with Derived=thrust::cuda_cub::par_t, InputIt=thrust::cuda_cub::transform_input_iterator_t<const Omega_h::LO, Omega_h::LO *, thrust::identity<Omega_h::LO>>, Size=std::ptrdiff_t, OutputIt=Omega_h::LO *, ScanOp=thrust::maximum<Omega_h::LO>]"                                                                                                                                                                                                                                                 
/opt/cuda/bin/../targets/x86_64-linux/include/thrust/system/cuda/detail/transform_scan.h(74): here                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
            instantiation of "OutputIt thrust::cuda_cub::transform_inclusive_scan(thrust::cuda_cub::execution_policy<Derived> &, InputIt, InputIt, OutputIt, TransformOp, ScanOp) [with Derived=thrust::cuda_cub::par_t, InputIt=Omega_h::LO *, OutputIt=Omega_h::LO *, TransformOp=thrust::identity<Omega_h::LO>, ScanOp=thrust::maximum<Omega_h::LO>]"                                                                                                                                                                                                                                                                                                   
/opt/cuda/bin/../targets/x86_64-linux/include/thrust/detail/transform_scan.inl(47): here                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
            instantiation of "OutputIterator thrust::transform_inclusive_scan(const thrust::detail::execution_policy_base<DerivedPolicy> &, InputIterator, InputIterator, OutputIterator, UnaryFunction, AssociativeOperator) [with DerivedPolicy=thrust::cuda_cub::par_t, InputIterator=Omega_h::LO *, OutputIterator=Omega_h::LO *, UnaryFunction=thrust::identity<Omega_h::LO>, AssociativeOperator=thrust::maximum<Omega_h::LO>]"                                                                                                                                                                                                                      
/home/cwsmith/develop/omega_h/src/Omega_h_scan.hpp(84): here                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
            instantiation of "OutputIterator Omega_h::transform_inclusive_scan(InputIterator, InputIterator, OutputIterator, BinaryOp, UnaryOp) [with InputIterator=Omega_h::LO *, OutputIterator=Omega_h::LO *, BinaryOp=Omega_h::maximum<Omega_h::LO>, UnaryOp=Omega_h::identity<Omega_h::LO>]"                                                                                                                                                                                                                                                                                                                                                          
/home/cwsmith/develop/omega_h/src/Omega_h_int_scan.cpp(32): here                      

@ibaned It looks like this issue persists in cuda 14.4 still when it seemed like it was fixed. The 11.2 issue references the thrust regression test here. Although curiously no template parameter is used in the regression test here for thrust::identity<>{}.

I gave this a try in Omega_h with my fork here which fixed the compile. However the amr_test fails with the error:

assertion edge_leaves == e failed at ~/omega_h/src/amr_test.cpp +188

I am not familiar with this test so I assume that the assertion should be true. If that's the case then something is broke in the thrust updates.

I just tried with the fork with Cuda 11.1 and it gives the error:

~/omega_h/src/Omega_h_reduce.hpp(64): error: too few arguments for class template "thrust::identity"

I am going to put together a minimal example to run by the thrust team to see what is going on.

I found another way around the issue that should be backward compatible with earlier Cuda versions. The new code is here. It is definitely not ideal (doing a negate transform before doing a negate transform/max scan as an alternative to using the identity transform/max scan only).

Although we still have an error with the amr_test which may not necessarily be due to this. I dug a bit further, and it looks like something beginning in Cuda 11.2.0 breaks the amr_test. I put print statements here to show the failing edge leaves in 3D (post-refinement) which gives the output below:

Failing values:
i, edge_leaves[i], e[i]
59, 1, 0
60, 0, 1
62, 1, 0
63, 0, 1
69, 1, 0
71, 0, 1
assertion edge_leaves == e failed at ~/omega_h/src/amr_test.cpp +195

I am not really sure how to go about debugging this without knowledge of the amr_test, but considering that the test works fine with Cuda 11.1 with the double negate transform scan, it appears that something in Cuda 11.2 has changed.

I wonder if directly using the CUB device scan methods would work:
https://nvlabs.github.io/cub/structcub_1_1_device_scan.html

CUB github repo: https://github.com/NVlabs/cub