AMReX-Codes/amrex

AMReX CUDA issues with CUDA 11.6

zingale opened this issue · 14 comments

If you build amrex-tutorials/ExampleCodes/Amr/Advection_AmrCore with CUDA 11.6 and run as:

./main2d.gnu.MPI.CUDA.ex inputs geometry.is_periodic=1 1

we get:

amrex::Abort::0::GPU last error detected in file /home/zingale/development/amrex//Src/Base/AMReX_GpuLaunchFunctsG.H line 834: invalid device function !!!
SIGABRT
See Backtrace.0 file for details

a similar issue is happening with Castro with the latest CUDA. This does not appear to be an issue with CUDA 11.4

This happens for Quokka as well. It works fine for CUDA 11.4 and 11.5. For me, the issue appears to be when it calls the physical boundary functor, which only happens when the simulation is not fully periodic (quokka-astro/quokka#21). So this might be a different device function that it's failing on.

@zingale Could you try to compile and run without MPI?

The test works for me with CUDA 11.6.0, with and without MPI. (I printed out __CUDACC_VER_MAJOR__ and __CUDACC_VER_MINOR__ to make sure it was using 11.6.)

I made a mistake above, you need to test with geometry.is_periodic=0 0

I get the error with and without MPI

Now I get the error too.

I can reproduce. I get this backtrace, which suggests it is also failing to compile the boundary functor correctly:

 0: ./main2d.gnu.DEBUG.MPI.CUDA.ex() [0x707a6c]
    amrex::BLBackTrace::print_backtrace_info(_IO_FILE*)
/home/bwibking/amrex/Src/Base/AMReX_BLBackTrace.cpp:175:25

 1: ./main2d.gnu.DEBUG.MPI.CUDA.ex() [0x707649]
    amrex::BLBackTrace::handler(int)
/home/bwibking/amrex/Src/Base/AMReX_BLBackTrace.cpp:85:7

 2: ./main2d.gnu.DEBUG.MPI.CUDA.ex() [0x51242f]
    amrex::ParallelDescriptor::Abort(int, bool)
/home/bwibking/amrex/Src/Base/AMReX_ParallelDescriptor.cpp:206:21

 3: ./main2d.gnu.DEBUG.MPI.CUDA.ex() [0x49ef53]
    amrex::Abort_host(char const*)
/home/bwibking/amrex/Src/Base/AMReX.cpp:246:1

 4: ./main2d.gnu.DEBUG.MPI.CUDA.ex() [0x49ed33]
    amrex::Abort(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
/home/bwibking/amrex/Src/Base/AMReX.cpp:198:1

 5: ./main2d.gnu.DEBUG.MPI.CUDA.ex() [0x414cd6]
    amrex::Gpu::ErrorCheck(char const*, int)
/home/bwibking/amrex/Src/Base/AMReX_GpuError.H:54:240

 6: ./main2d.gnu.DEBUG.MPI.CUDA.ex() [0x481610]
    std::enable_if<amrex::MaybeDeviceRunnable<__nv_dl_wrapper_t<__nv_dl_tag<void (amrex::GpuBndryFuncFab<AmrCoreFill>::*)(amrex::Box const&, amrex::FArrayBox&, int, int, amrex::Geometry const&, double, amrex::Vector<amrex::BCRec, std::allocator<amrex::BCRec> > const&, int, int, amrex::FilccCell&&), &(void amrex::GpuBndryFuncFab<AmrCoreFill>::ccfcdoit<amrex::FilccCell>(amrex::Box const&, amrex::FArrayBox&, int, int, amrex::Geometry const&, double, amrex::Vector<amrex::BCRec, std::allocator<amrex::BCRec> > const&, int, int, amrex::FilccCell&&)), 2u>, amrex::Box*, int const, amrex::FilccCell, amrex::Array4<double> const, int const, int const, amrex::Box const, amrex::BCRec*, AmrCoreFill const, amrex::GeometryData const, double const, int const>, void>::value, void>::type amrex::ParallelFor<long, __nv_dl_wrapper_t<__nv_dl_tag<void (amrex::GpuBndryFuncFab<AmrCoreFill>::*)(amrex::Box const&, amrex::FArrayBox&, int, int, amrex::Geometry const&, double, amrex::Vector<amrex::BCRec, std::allocator<amrex::BCRec> > const&, int, int, amrex::FilccCell&&), &(void amrex::GpuBndryFuncFab<AmrCoreFill>::ccfcdoit<amrex::FilccCell>(amrex::Box const&, amrex::FArrayBox&, int, int, amrex::Geometry const&, double, amrex::Vector<amrex::BCRec, std::allocator<amrex::BCRec> > const&, int, int, amrex::FilccCell&&)), 2u>, amrex::Box*, int const, amrex::FilccCell, amrex::Array4<double> const, int const, int const, amrex::Box const, amrex::BCRec*, AmrCoreFill const, amrex::GeometryData const, double const, int const>, void>(amrex::Gpu::KernelInfo const&, long, __nv_dl_wrapper_t<__nv_dl_tag<void (amrex::GpuBndryFuncFab<AmrCoreFill>::*)(amrex::Box const&, amrex::FArrayBox&, int, int, amrex::Geometry const&, double, amrex::Vector<amrex::BCRec, std::allocator<amrex::BCRec> > const&, int, int, amrex::FilccCell&&), &(void amrex::GpuBndryFuncFab<AmrCoreFill>::ccfcdoit<amrex::FilccCell>(amrex::Box const&, amrex::FArrayBox&, int, int, amrex::Geometry const&, double, amrex::Vector<amrex::BCRec, std::allocator<amrex::BCRec> > const&, int, int, amrex::FilccCell&&)), 2u>, amrex::Box*, int const, amrex::FilccCell, amrex::Array4<double> const, int const, int const, amrex::Box const, amrex::BCRec*, AmrCoreFill const, amrex::GeometryData const, double const, int const>&&)
/home/bwibking/amrex/Src/Base/AMReX_GpuLaunchFunctsG.H:834:23

 7: ./main2d.gnu.DEBUG.MPI.CUDA.ex() [0x474a1d]
    void amrex::ParallelFor<long, __nv_dl_wrapper_t<__nv_dl_tag<void (amrex::GpuBndryFuncFab<AmrCoreFill>::*)(amrex::Box const&, amrex::FArrayBox&, int, int, amrex::Geometry const&, double, amrex::Vector<amrex::BCRec, std::allocator<amrex::BCRec> > const&, int, int, amrex::FilccCell&&), &(void amrex::GpuBndryFuncFab<AmrCoreFill>::ccfcdoit<amrex::FilccCell>(amrex::Box const&, amrex::FArrayBox&, int, int, amrex::Geometry const&, double, amrex::Vector<amrex::BCRec, std::allocator<amrex::BCRec> > const&, int, int, amrex::FilccCell&&)), 2u>, amrex::Box*, int const, amrex::FilccCell, amrex::Array4<double> const, int const, int const, amrex::Box const, amrex::BCRec*, AmrCoreFill const, amrex::GeometryData const, double const, int const>, void>(long, __nv_dl_wrapper_t<__nv_dl_tag<void (amrex::GpuBndryFuncFab<AmrCoreFill>::*)(amrex::Box const&, amrex::FArrayBox&, int, int, amrex::Geometry const&, double, amrex::Vector<amrex::BCRec, std::allocator<amrex::BCRec> > const&, int, int, amrex::FilccCell&&), &(void amrex::GpuBndryFuncFab<AmrCoreFill>::ccfcdoit<amrex::FilccCell>(amrex::Box const&, amrex::FArrayBox&, int, int, amrex::Geometry const&, double, amrex::Vector<amrex::BCRec, std::allocator<amrex::BCRec> > const&, int, int, amrex::FilccCell&&)), 2u>, amrex::Box*, int const, amrex::FilccCell, amrex::Array4<double> const, int const, int const, amrex::Box const, amrex::BCRec*, AmrCoreFill const, amrex::GeometryData const, double const, int const>&&)
/home/bwibking/amrex/Src/Base/AMReX_GpuLaunchFunctsG.H:1270:1

 8: ./main2d.gnu.DEBUG.MPI.CUDA.ex() [0x452681]
    void amrex::GpuBndryFuncFab<AmrCoreFill>::ccfcdoit<amrex::FilccCell>(amrex::Box const&, amrex::FArrayBox&, int, int, amrex::Geometry const&, double, amrex::Vector<amrex::BCRec, std::allocator<amrex::BCRec> > const&, int, int, amrex::FilccCell&&)
/home/bwibking/amrex/Src/Base/AMReX_PhysBCFunct.H:393:24

 9: ./main2d.gnu.DEBUG.MPI.CUDA.ex() [0x48052d]
    amrex::GpuBndryFuncFab<AmrCoreFill>::operator()(amrex::Box const&, amrex::FArrayBox&, int, int, amrex::Geometry const&, double, amrex::Vector<amrex::BCRec, std::allocator<amrex::BCRec> > const&, int, int)
/home/bwibking/amrex/Src/Base/AMReX_PhysBCFunct.H:204:1

10: ./main2d.gnu.DEBUG.MPI.CUDA.ex() [0x473e16]
    amrex::PhysBCFunct<amrex::GpuBndryFuncFab<AmrCoreFill> >::operator()(amrex::MultiFab&, int, int, amrex::IntVect const&, double, int)
/home/bwibking/amrex/Src/Base/AMReX_PhysBCFunct.H:177:1

11: ./main2d.gnu.DEBUG.MPI.CUDA.ex() [0x44e824]
    std::enable_if<amrex::IsFabArray<amrex::MultiFab, void>::value, void>::type amrex::FillPatchSingleLevel<amrex::MultiFab, amrex::PhysBCFunct<amrex::GpuBndryFuncFab<AmrCoreFill> > >(amrex::MultiFab&, amrex::IntVect const&, double, amrex::Vector<amrex::MultiFab*, std::allocator<amrex::MultiFab*> > const&, amrex::Vector<double, std::allocator<double> > const&, int, int, int, amrex::Geometry const&, amrex::PhysBCFunct<amrex::GpuBndryFuncFab<AmrCoreFill> >&, int)
/home/bwibking/amrex/Src/AmrCore/AMReX_FillPatchUtil_I.H:159:1

12: ./main2d.gnu.DEBUG.MPI.CUDA.ex() [0x43c320]
    std::enable_if<amrex::IsFabArray<amrex::MultiFab, void>::value, void>::type amrex::FillPatchSingleLevel<amrex::MultiFab, amrex::PhysBCFunct<amrex::GpuBndryFuncFab<AmrCoreFill> > >(amrex::MultiFab&, double, amrex::Vector<amrex::MultiFab*, std::allocator<amrex::MultiFab*> > const&, amrex::Vector<double, std::allocator<double> > const&, int, int, int, amrex::Geometry const&, amrex::PhysBCFunct<amrex::GpuBndryFuncFab<AmrCoreFill> >&, int)
/home/bwibking/amrex/Src/AmrCore/AMReX_FillPatchUtil_I.H:42:21

13: ./main2d.gnu.DEBUG.MPI.CUDA.ex() [0x42a6da]
    AmrCoreAdv::FillPatch(int, double, amrex::MultiFab&, int, int)
..//Source/AmrCoreAdv.cpp:414:28

14: ./main2d.gnu.DEBUG.MPI.CUDA.ex() [0x40f5b8]
    AmrCoreAdv::AdvancePhiAtLevel(int, double, double, int, int)
..//Source/AdvancePhiAtLevel.cpp:42:18

15: ./main2d.gnu.DEBUG.MPI.CUDA.ex() [0x42bbed]
    AmrCoreAdv::timeStepWithSubcycling(int, double, int)
..//Source/AmrCoreAdv.cpp:579:14

16: ./main2d.gnu.DEBUG.MPI.CUDA.ex() [0x42804c]
    AmrCoreAdv::Evolve()
..//Source/AmrCoreAdv.cpp:112:29

17: ./main2d.gnu.DEBUG.MPI.CUDA.ex() [0x49e091]
    main
..//Source/main.cpp:34:31

This seems to be a CUDA bug. The issue appears to be functor like below

struct AmrCoreFill
{   
    AMREX_GPU_DEVICE
    void operator() (const amrex::IntVect& /*iv*/, amrex::Array4<amrex::Real> const& /*data*/,
                     const int /*dcomp*/, const int /*numcomp*/,
                     amrex::GeometryData const& /*geom*/, const amrex::Real /*time*/,
                     const amrex::BCRec* /*bcr*/, const int /*bcomp*/,
                     const int /*orig_comp*/) const
        {   
            // do something for external Dirichlet (BCType::ext_dir)                                                                                                                                       
        }
};

Is there a bug report open with NVIDIA for this?

@maxpkatz Do you have any comments? I have tried to come up with a small reproducer. The best I have is the following.

$ git clone -b cuda16 https://github.com/WeiqunZhang/amrex.git
$ cd cd amrex/Tests/Amr/Advection_AmrCore/Exec/
$ make -j
$ ./main2d.gnu.CUDA.ex` inputs

It will produce

Initializing CUDA...
CUDA initialized with 1 GPU
AMReX (20.08-1153-g0f96c2b0f591) initialized
Successfully read inputs file ... 

Coarse STEP 1 starts ...
[Level 0 step 1] ADVANCE with time = 0 dt = 0.01093983378

xxxxx void amrex::GpuBndryFuncFab<F>::operator()(const amrex::Box&, amrex::FArrayBox&, int, int, const amrex::Geometry&, amrex::Real, const amrex::Vector<amrex::BCRec>&, int, int) [with F = AmrCoreFill; amrex::Real = double]: 211


xxxxx void amrex::GpuBndryFuncFab<F>::operator()(const amrex::Box&, amrex::FArrayBox&, int, int, const amrex::Geometry&, amrex::Real, const amrex::Vector<amrex::BCRec>&, int, int) [with F = AmrCoreFill; amrex::Real = double]: 222


xxxxx void amrex::GpuBndryFuncFab<F>::operator()(const amrex::Box&, amrex::FArrayBox&, int, int, const amrex::Geometry&, amrex::Real, const amrex::Vector<amrex::BCRec>&, int, int) [with F = AmrCoreFill; amrex::Real = double]: 230


xxxxx void amrex::GpuBndryFuncFab<F>::ccfcdoit(const amrex::Box&, amrex::FArrayBox&, int, int, const amrex::Geometry&, amrex::Real, const amrex::Vector<amrex::BCRec>&, int, int, FF&&) [with FF = amrex::FilccCell; F = AmrCoreFill; amrex::Real = double]: 376


xxxxx void amrex::GpuBndryFuncFab<F>::ccfcdoit(const amrex::Box&, amrex::FArrayBox&, int, int, const amrex::Geometry&, amrex::Real, const amrex::Vector<amrex::BCRec>&, int, int, FF&&) [with FF = amrex::FilccCell; F = AmrCoreFill; amrex::Real = double]: 387

amrex::Abort::0::GPU last error detected in file ../../../../Src/Base/AMReX_GpuLaunchFunctsG.H line 834: invalid device function !!!
SIGABRT
See Backtrace.0 file for details

It dies at https://github.com/WeiqunZhang/amrex/blob/cuda16/Src/Base/AMReX_PhysBCFunct.H#L391. This looks like a bug in CUDA 16. A few observations.

  1. The functor is a no-op. void operator() (const amrex::Real) const {}. https://github.com/WeiqunZhang/amrex/blob/cuda16/Src/Base/AMReX_PhysBCFunct.H#L195
  2. A previous kernel that is almost identical has no errors. The only difference is the successful kernel lambda-captures a local variable, whereas the failed one lambda-captures a function argument. https://github.com/WeiqunZhang/amrex/blob/cuda16/Src/Base/AMReX_PhysBCFunct.H#L383
  3. The parent function for the failed kernel is GpuBndryFuncFab<F>::ccfcdoit. It is called from GpuBndryFuncFab<F>::operator() https://github.com/WeiqunZhang/amrex/blob/cuda16/Src/Base/AMReX_PhysBCFunct.H#L203 The identical code there right before ccfcdoit does not produce any errors.
  4. We have seen similar errors in WarpX as well.

@WeiqunZhang This is the minimal reproducer:

#include <iostream>

template <class F>
__global__ void kernel (F f) { f(); }

struct s1
{
    __device__
    void operator() (const double t) const {}
};

struct s2 {};

template <class F>
class c
{
public:
    c (F const& f) {}

    template <class FF>
    void
    operator() (const double t, FF&& ff)
    {
        s1 s{};

        const double t1 = t;

        kernel<<<1, 1>>>([=] __device__ () { s(t1); }); // works
        cudaDeviceSynchronize();
        std::cout << "CUDA return value: " << cudaGetErrorString(cudaGetLastError()) << std::endl;

        kernel<<<1, 1>>>([=] __device__ () { s(t); }); // invalid device function
        cudaDeviceSynchronize();
        std::cout << "CUDA return value: " << cudaGetErrorString(cudaGetLastError()) << std::endl;
    }
};

int main()
{
    c<s2> c1(s2{});
    c1(0.0, s2{});
}
$ nvcc -run -x cu --extended-lambda --std=c++14 test.cpp
CUDA return value: no error
CUDA return value: invalid device function

I will discuss it with the NVIDIA compiler team.

I encountered the same issue as well. Thanks @maxpkatz for confirming this problem.

@maxpkatz Do you have any updates?

Yes, this will be fixed in the next CUDA minor version release.