lammps/lammps

[BUG] Segmentation fault in Kokkos MPI/GPU

alphataubio opened this issue ยท 7 comments

Summary

Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x30500d680)

LAMMPS Version and Platform

Large-scale Atomic/Molecular Massively Parallel Simulator - 21 Nov 2023 - Development
Git info (cmap-fixes-for-charmm-gui / patch_21Nov2023-369-g695a81ef70)
OS: Linux "CentOS Linux 7 (Core)" 3.10.0-1160.88.1.el7.x86_64 x86_64
Compiler: GNU C++ 11.3.0 with OpenMP 4.5
C++ standard: C++17
MPI v3.1: Open MPI v4.1.4, package: Open MPI ebuser@build-node.computecanada.ca Distribution, ident: 4.1.4, repo rev: v4.1.4, May 26, 2022
Accelerator configuration:
KOKKOS package API: CUDA OpenMP
KOKKOS package precision: double
Kokkos library version: 4.2.0
Active compile time flags:
-DLAMMPS_GZIP
-DLAMMPS_PNG
-DLAMMPS_JPEG
-DLAMMPS_FFMPEG
-DLAMMPS_SMALLBIG
Installed packages:
COLVARS EXTRA-DUMP EXTRA-PAIR KOKKOS KSPACE MOLECULE RIGID

Expected Behavior

dont crash

Actual Behavior

[gra984:11459:0:11459] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x30500d680)
BFD: DWARF error: can't find .debug_ranges section.
BFD: DWARF error: can't find .debug_ranges section.
[...]
[gra984:11458:0:11458] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x305077680)
BFD: DWARF error: can't find .debug_ranges section.
BFD: DWARF error: can't find .debug_ranges section.
BFD: DWARF error: can't find .debug_ranges section.
BFD: DWARF error: can't find .debug_ranges section.
[...]
==== backtrace (tid: 11459) ====
0 0x00000000000130f0 __funlockfile() :0
1 0x0000000000038e5c ucp_dt_contig_pack() ???:0
2 0x0000000000039709 ucp_dt_pack() ???:0
3 0x000000000005f298 ucp_tag_pack_eager_first_dt() eager_snd.c:0
4 0x00000000000184ae uct_mm_ep_am_bcopy() ???:0
5 0x000000000005e8df ucp_tag_eager_bcopy_multi() eager_snd.c:0
6 0x0000000000067bb5 ucp_tag_send_nbx() ???:0
7 0x0000000000005a52 mca_pml_ucx_send() ???:0
8 0x00000000000af2bb MPI_Send() ???:0
9 0x0000000001dc5bd5 LAMMPS_NS::Grid3dKokkosKokkos::Cuda::reverse_comm_kspace_brick() ???:0
10 0x00000000011d0701 LAMMPS_NS::PPPMKokkosKokkos::Cuda::compute() ???:0
=================================
[gra984:11459] *** Process received signal ***
[gra984:11459] Signal: Segmentation fault (11)
[gra984:11459] Signal code: (-6)
[gra984:11459] Failing at address: 0x2fce3100002cc3
[gra984:11459] [ 0] /cvmfs/soft.computecanada.ca/gentoo/2020/lib64/libpthread.so.0(+0x130f0)[0x2afb11d5f0f0]
[gra984:11459] [ 1] /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/ucx/1.12.1/lib/libucp.so.0(ucp_dt_contig_pack+0x4c)[0x2afb17d62e5c]
[gra984:11459] [ 2] /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/ucx/1.12.1/lib/libucp.so.0(ucp_dt_pack+0x69)[0x2afb17d63709]

Steps to Reproduce

salloc --time=1:0:0 --mem-per-cpu=2G --gpus=p100:1 --ntasks=2 --account=<~~~>
module purge; module load arch/avx2 StdEnv/2020 gcc/11.3.0 cuda/11.8.0 openmpi/4.1.4
export OMP_PROC_BIND=spread OMP_PLACES=threads
mpirun -np 2 ~/.local/bin/lmp -k on g 1 -sf kk -in step4.0_minimization.inp

Further Information, Files, and Links

3ft6.tgz

some hypotheses / things im gonna try and/or investigate after i go eat my lunch...

(1) compile warnings:

/lammps/src/COLVARS/colvarproxy_lammps.h(27): warning #611-D: overloaded virtual function "colvarproxy_atoms::init_atom" is only partially overridden in class "colvarproxy_lammps"
/lammps/src/COLVARS/colvarproxy_lammps.h(27): warning #611-D: overloaded virtual function "colvarproxy_atoms::check_atom_id" is only partially overridden in class "colvarproxy_lammps"
/lammps/src/COLVARS/colvarproxy_lammps.h(27): warning #611-D: overloaded virtual function "colvarproxy_atoms::init_atom" is only partially overridden in class "colvarproxy_lammps"
/lammps/src/COLVARS/colvarproxy_lammps.h(27): warning #611-D: overloaded virtual function "colvarproxy_atoms::check_atom_id" is only partially overridden in class "colvarproxy_lammps"
/lammps/src/COLVARS/colvarproxy_lammps.h(27): warning #611-D: overloaded virtual function "colvarproxy_atoms::init_atom" is only partially overridden in class "colvarproxy_lammps"
/lammps/src/COLVARS/colvarproxy_lammps.h(27): warning #611-D: overloaded virtual function "colvarproxy_atoms::check_atom_id" is only partially overridden in class "colvarproxy_lammps"
ammps/src/COLVARS/colvarproxy_lammps.h(27): warning #611-D: overloaded virtual function "colvarproxy_atoms::init_atom" is only partially overridden in class "colvarproxy_lammps"
/lammps/src/COLVARS/colvarproxy_lammps.h(27): warning #611-D: overloaded virtual function "colvarproxy_atoms::check_atom_id" is only partially overridden in class "colvarproxy_lammps"

ptxas warning : Stack size for entry function 'ZN6Kokkos4Impl33cuda_parallel_launch_local_memoryINS0_11ParallelForI16kiss_fft_functorINS_4CudaEENS_11RangePolicyIJS4_EEES4_EEEEvT' cannot be statically determined

(2) runtime warnings:

WARNING: Fix with atom-based arrays not compatible with sending data in Kokkos communication, switching to classic exchange/border communication (src/KOKKOS/comm_kokkos.cpp:666)
WARNING: Fix with atom-based arrays not compatible with Kokkos sorting on device, switching to classic host sorting (src/KOKKOS/atom_kokkos.cpp:215)

(3) compile kokkos with host HSW instead of BDW, even if the cpus are "2 x Intel E5-2683 v4 Broadwell @ 2.1GHz". maybe the "transactional mem" is a problem

HSW | HOST | Intel Haswell CPU (AVX 2)
BDW | HOST | Intel Broadwell Xeon E-class CPU (AVX 2 + transactional mem)

(4) compile kokkos with gpu arch PASCAL61 instead of PASCAL60 for the gpus "2 x NVIDIA P100 Pascal (12GB HBM2 memory)"

PASCAL60 | GPU | NVIDIA Pascal generation CC 6.0 GPU
PASCAL61 | GPU | NVIDIA Pascal generation CC 6.1 GPU

(5) compile kokkos with -DKokkos_ENABLE_DEBUG=on

(6) try building openmpi from scratch instead of loading cluster module, or try "-pk kokkos gpu/aware off"(https://docs.lammps.org/Speed_kokkos.html)

CUDA and MPI library compatibility

Kokkos with CUDA currently implicitly assumes that the MPI library is GPU-aware. This is not always the case, especially when using pre-compiled MPI libraries provided by a Linux distribution. This is not a problem when using only a single GPU with a single MPI rank. When running with multiple MPI ranks, you may see segmentation faults without GPU-aware MPI support. These can be avoided by adding the flags -pk kokkos gpu/aware off to the LAMMPS command line or by using the command package kokkos gpu/aware off in the input file.

Does it run OK on 1 MPI rank/GPU? Almost certainly related to GPU-aware MPI looking at the stack trace.

yes my fault i sincerely apologize @stanmoore1 for not RTFM. i didnt read the "CUDA and MPI library compatibility" blue box in the docs correctly. i thought the problem was with having multiple GPUs per MPI rank (eg. 2 MPI processes and 4 gpus, ...) not the other way around.

mpirun -np 2 ~/.local/bin/lmp -in step4.0_minimization.inp -k on g 1

segmentation fault in openmpi/4.1.4 library

mpirun -np 1 ~/.local/bin/lmp -in step4.0_minimization.inp -k on g 1

OK

mpirun -np 2 ~/.local/bin/lmp -in step4.0_minimization.inp -k on g 2

OK

mpirun -np 2 ~/.local/bin/lmp -in step4.0_minimization.inp -k on g 1 -sf kk -pk kokkos gpu/aware off

OK

mpirun -np 2 ~/.local/bin/lmp -in step4.0_minimization.inp -k on g 2 -sf kk -pk kokkos gpu/aware off

OK

my available cluster node types are (i dont use V100 or A100 nodes i leave those for others):
(cedar) 2 cpus, 24 cores, 2 threads per core --- 4xP100, Two GPUs per CPU socket connected via PCIe
(graham) 2 cpus, 32 cores, 2 threads per core --- 2xP100, One GPU per CPU socket connected via PCIe

salloc --time=59:00 --mem-per-cpu=2G --gpus=p100:2 --ntasks=32 --account=<~~~>
module load StdEnv/2023 cudacore/.12.2.2 nvhpc/23.9 ucx-cuda/1.14.1 openmpi/4.1.5
export LD_LIBRARY_PATH=/cvmfs/restricted.computecanada.ca/easybuild/software/2023/x86-64-v3/Core/nvhpc/23.9/Linux_x86_64/23.9/REDIST/compilers/lib
mpirun -np 32 ~/.local/bin/lmp -in step5_production.inp -k on g 2 -sf kk

OK

what are the following warnings about ?

WARNING: Fix with atom-based arrays not compatible with sending data in Kokkos communication, switching to classic exchange/border communication (src/KOKKOS/comm_kokkos.cpp:666)

WARNING: Fix with atom-based arrays not compatible with Kokkos sorting on device, switching to classic host sorting (src/KOKKOS/atom_kokkos.cpp:215)

suggestion ... detect this gpu not aware mpi library issue, instead of being implicit, and fail with an error message "hey buddy please go read https://docs.lammps.org/Speed_kokkos.html"

what are the following warnings about ?

WARNING: Fix with atom-based arrays not compatible with sending data in Kokkos communication, switching to classic >exchange/border communication (src/KOKKOS/comm_kokkos.cpp:666)

WARNING: Fix with atom-based arrays not compatible with Kokkos sorting on device, switching to classic host sorting >(src/KOKKOS/atom_kokkos.cpp:215)

It means you have some extra CPU <--> GPU data movement because some styles are not yet ported to use Kokkos. Your simulation will give the correct answer but may be slower than it could be without the data movement. However, exchange/border comm only happens every reneighbor, and sorting only every 1000 timesteps by default, so it should be a small effect because it is amortized over many timesteps. In other words, safe to ignore.

suggestion ... detect this gpu not aware mpi library issue, instead of being implicit, and fail with an error message "hey buddy please go read https://docs.lammps.org/Speed_kokkos.html"

It should give a warning:
https://github.com/lammps/lammps/blob/develop/src/KOKKOS/kokkos.cpp#L289C30-L290C67

"Turning off GPU-aware MPI since it is not detected, "
                       "use '-pk kokkos gpu/aware on' to override"

Or

https://github.com/lammps/lammps/blob/develop/src/KOKKOS/kokkos.cpp#L348-L350

"Kokkos with GPU-enabled backend assumes GPU-aware MPI is available,"
                       " but cannot determine if this is the case\n         try"
                       " '-pk kokkos gpu/aware off' if getting segmentation faults");

Did you not get a warning?

Did you not get a warning?

no, here's the full stderr if you care to see it. maybe it was in stdout but i dont remember seeing it

log.lammps.stderr.txt

now my lammps is properly built with ucx/openmpi cuda aware, so i cant try it again sorry

but thanks again anyways, i always appreciate your efforts.