NERSC/buildtest-nersc

[Bug]: Trilinos test unable to link to GTL library

Closed this issue · 6 comments

CDASH Build

https://my.cdash.org/test/66436454

Link to buildspec file

https://github.com/buildtesters/buildtest-nersc/blob/devel/buildspecs/e4s/E4S-Testsuite/perlmutter/22.05/trilinos.yml

Please describe the issue?

@wspear We have seen this error reported by multiple users and this is somewhat documented in https://docs.nersc.gov/development/compilers/wrappers/#set-the-accelerator-target-to-gpus-for-cuda-aware-mpi-on-perlmutter

The thing is that we have gpu module loaded by default and this loads craype-accel-nvidia80 modulefile along with setting this envrionment MPICH_GPU_SUPPORT_ENABLED

 ~/ ml show gpu
---------------------------------------------------------------------------------------------------------------------------------------------------------
   /global/common/software/nersc/pm-2022.08.4/extra_modulefiles/gpu/1.0.lua:
---------------------------------------------------------------------------------------------------------------------------------------------------------
family("hardware")
load("cudatoolkit")
load("craype-accel-nvidia80")
setenv("MPICH_GPU_SUPPORT_ENABLED","1")

I dont know if its worth trying to load the cpu modulefile considering this test is building for trilinos without cuda support. I guess if we want to run the trilinos with cuda test we should run https://github.com/E4S-Project/testsuite/tree/master/validation_tests/trilinos-cuda test?

Relevant log output

trilinos~cuda %gcc: 2mphikm
Cleaning /global/cfs/cdirs/m3503/buildtest/runs/perlmutter_scheduled_test/2022-11-07/perlmutter.slurm.regular/trilinos/trilinos_e4s_testsuite_22.05/b8041801/stage/testsuite/validation_tests/trilinos
---CLEANUP LOG---
Compiling /global/cfs/cdirs/m3503/buildtest/runs/perlmutter_scheduled_test/2022-11-07/perlmutter.slurm.regular/trilinos/trilinos_e4s_testsuite_22.05/b8041801/stage/testsuite/validation_tests/trilinos
---COMPILE LOG---
Skipping load: Environment already setup
+ mkdir -p build
+ cd build
+ CUDADEF=
+ '[' 0 -eq 1 ']'
+ cmake ..

Found Trilinos!  Here are the details: 
   Trilinos_DIR = /global/common/software/spackecp/perlmutter/e4s-22.05/73973/spack/opt/spack/cray-sles15-zen3/gcc-11.2.0/trilinos-13.0.1-2mphikmel6uw6sjphv46xurbmswyvmzx/lib/cmake/Trilinos
   Trilinos_VERSION = 13.0
   Trilinos_PACKAGE_LIST = TrilinosCouplings;Piro;ROL;Stokhos;Tempus;Rythmos;ShyLU;ShyLU_DD;ShyLU_DDFROSch;ShyLU_DDBDDC;Zoltan2;MueLu;NOX;Phalanx;STK;STKExprEval;STKTransfer;STKSearchUtil;STKSearch;STKMesh;STKTopology;STKSimd;STKUtil;STKMath;Intrepid2;Intrepid;Teko;Stratimikos;Ifpack2;Anasazi;Amesos2;ShyLU_Node;Belos;ML;Ifpack;Zoltan2Core;Amesos;Galeri;AztecOO;Isorropia;Xpetra;Thyra;ThyraTpetraAdapters;ThyraEpetraExtAdapters;ThyraEpetraAdapters;ThyraCore;TrilinosSS;Tpetra;TpetraCore;TpetraTSQR;TpetraClassic;EpetraExt;Triutils;Shards;Zoltan;Epetra;MiniTensor;Sacado;RTOp;KokkosKernels;Teuchos;TeuchosKokkosComm;TeuchosKokkosCompat;TeuchosRemainder;TeuchosNumerics;TeuchosComm;TeuchosParameterList;TeuchosParser;TeuchosCore;Kokkos;KokkosAlgorithms;KokkosContainers;KokkosCore
   Trilinos_LIBRARIES = ml;ifpack;amesos;isorropia;trilinosss;zoltan;galeri-xpetra;galeri-epetra;xpetra-sup;xpetra;thyratpetra;thyraepetraext;thyraepetra;thyracore;tpetraext;tpetrainout;tpetra;kokkostsqr;tpetraclassiclinalg;tpetraclassicnodeapi;tpetraclassic;epetraext;triutils;rtop;kokkoskernels;kokkosalgorithms;kokkoscontainers;aztecoo;epetra;teuchoskokkoscomm;teuchoskokkoscompat;teuchosremainder;teuchosnumerics;teuchoscomm;teuchosparameterlist;teuchosparser;teuchoscore;kokkoscore
   Trilinos_INCLUDE_DIRS = /global/common/software/spackecp/perlmutter/e4s-22.05/73973/spack/opt/spack/cray-sles15-zen3/gcc-11.2.0/trilinos-13.0.1-2mphikmel6uw6sjphv46xurbmswyvmzx/include
   Trilinos_LIBRARY_DIRS = /global/common/software/spackecp/perlmutter/e4s-22.05/73973/spack/opt/spack/cray-sles15-zen3/gcc-11.2.0/trilinos-13.0.1-2mphikmel6uw6sjphv46xurbmswyvmzx/lib
   Trilinos_TPL_LIST = DLlib;SuperLUDist;Zlib;ParMETIS;METIS;Boost;LAPACK;BLAS;MPI;HWLOC
   Trilinos_TPL_INCLUDE_DIRS = /global/common/software/spackecp/perlmutter/e4s-22.05/73973/spack/opt/spack/cray-sles15-zen3/gcc-11.2.0/superlu-dist-7.2.0-mozwjf33ihxdsqn5zv2d5llovrg2zotp/include;/global/common/software/spackecp/perlmutter/e4s-22.05/73973/spack/opt/spack/cray-sles15-zen3/gcc-11.2.0/zlib-1.2.12-ozmcyfjfv7i5gjjgklfsh43h67vzsuc5/include;/global/common/software/spackecp/perlmutter/e4s-22.05/73973/spack/opt/spack/cray-sles15-zen3/gcc-11.2.0/parmetis-4.0.3-r7ltmqs2igjzfmv7fhtg67x7vflmd47o/include;/global/common/software/spackecp/perlmutter/e4s-22.05/73973/spack/opt/spack/cray-sles15-zen3/gcc-11.2.0/metis-5.1.0-iawwq32vzsvijnkdegvxs6fcinz6s5pp/include;/global/common/software/spackecp/perlmutter/e4s-22.05/73973/spack/opt/spack/cray-sles15-zen3/gcc-11.2.0/boost-1.79.0-ywavkcteainr2nmzk3g7w7negn2alpbm/include;/global/common/software/spackecp/perlmutter/e4s-22.05/73973/spack/opt/spack/cray-sles15-zen3/gcc-11.2.0/openblas-0.3.20-kpue5bnywwglz4wnssagsb7wko2mpamg/include;/global/common/software/spackecp/perlmutter/e4s-22.05/73973/spack/opt/spack/cray-sles15-zen3/gcc-11.2.0/hwloc-2.7.1-qkmm5kxw4bu7hmirkzc566atvkfehkfb/include
   Trilinos_TPL_LIBRARIES = /usr/lib64/libdl.so;/global/common/software/spackecp/perlmutter/e4s-22.05/73973/spack/opt/spack/cray-sles15-zen3/gcc-11.2.0/superlu-dist-7.2.0-mozwjf33ihxdsqn5zv2d5llovrg2zotp/lib/libsuperlu_dist.so;/global/common/software/spackecp/perlmutter/e4s-22.05/73973/spack/opt/spack/cray-sles15-zen3/gcc-11.2.0/zlib-1.2.12-ozmcyfjfv7i5gjjgklfsh43h67vzsuc5/lib/libz.so;/global/common/software/spackecp/perlmutter/e4s-22.05/73973/spack/opt/spack/cray-sles15-zen3/gcc-11.2.0/parmetis-4.0.3-r7ltmqs2igjzfmv7fhtg67x7vflmd47o/lib/libparmetis.so;/global/common/software/spackecp/perlmutter/e4s-22.05/73973/spack/opt/spack/cray-sles15-zen3/gcc-11.2.0/metis-5.1.0-iawwq32vzsvijnkdegvxs6fcinz6s5pp/lib/libmetis.so;/global/common/software/spackecp/perlmutter/e4s-22.05/73973/spack/opt/spack/cray-sles15-zen3/gcc-11.2.0/metis-5.1.0-iawwq32vzsvijnkdegvxs6fcinz6s5pp/lib/libmetis.so;/global/common/software/spackecp/perlmutter/e4s-22.05/73973/spack/opt/spack/cray-sles15-zen3/gcc-11.2.0/openblas-0.3.20-kpue5bnywwglz4wnssagsb7wko2mpamg/lib/libopenblas.so;/global/common/software/spackecp/perlmutter/e4s-22.05/73973/spack/opt/spack/cray-sles15-zen3/gcc-11.2.0/openblas-0.3.20-kpue5bnywwglz4wnssagsb7wko2mpamg/lib/libopenblas.so;/global/common/software/spackecp/perlmutter/e4s-22.05/73973/spack/opt/spack/cray-sles15-zen3/gcc-11.2.0/hwloc-2.7.1-qkmm5kxw4bu7hmirkzc566atvkfehkfb/lib/libhwloc.so;/usr/lib64/libdl.so;/global/common/software/spackecp/perlmutter/e4s-22.05/73973/spack/opt/spack/cray-sles15-zen3/gcc-11.2.0/openblas-0.3.20-kpue5bnywwglz4wnssagsb7wko2mpamg/lib/libopenblas.so;/global/common/software/spackecp/perlmutter/e4s-22.05/73973/spack/opt/spack/cray-sles15-zen3/gcc-11.2.0/openblas-0.3.20-kpue5bnywwglz4wnssagsb7wko2mpamg/lib/libopenblas.so;/global/common/software/spackecp/perlmutter/e4s-22.05/73973/spack/opt/spack/cray-sles15-zen3/gcc-11.2.0/hwloc-2.7.1-qkmm5kxw4bu7hmirkzc566atvkfehkfb/lib/libhwloc.so;/usr/lib64/libdl.so;/global/common/software/spackecp/perlmutter/e4s-22.05/73973/spack/opt/spack/cray-sles15-zen3/gcc-11.2.0/openblas-0.3.20-kpue5bnywwglz4wnssagsb7wko2mpamg/lib/libopenblas.so;/global/common/software/spackecp/perlmutter/e4s-22.05/73973/spack/opt/spack/cray-sles15-zen3/gcc-11.2.0/openblas-0.3.20-kpue5bnywwglz4wnssagsb7wko2mpamg/lib/libopenblas.so;/global/common/software/spackecp/perlmutter/e4s-22.05/73973/spack/opt/spack/cray-sles15-zen3/gcc-11.2.0/hwloc-2.7.1-qkmm5kxw4bu7hmirkzc566atvkfehkfb/lib/libhwloc.so;/usr/lib64/libdl.so;/global/common/software/spackecp/perlmutter/e4s-22.05/73973/spack/opt/spack/cray-sles15-zen3/gcc-11.2.0/openblas-0.3.20-kpue5bnywwglz4wnssagsb7wko2mpamg/lib/libopenblas.so;/global/common/software/spackecp/perlmutter/e4s-22.05/73973/spack/opt/spack/cray-sles15-zen3/gcc-11.2.0/openblas-0.3.20-kpue5bnywwglz4wnssagsb7wko2mpamg/lib/libopenblas.so;/global/common/software/spackecp/perlmutter/e4s-22.05/73973/spack/opt/spack/cray-sles15-zen3/gcc-11.2.0/hwloc-2.7.1-qkmm5kxw4bu7hmirkzc566atvkfehkfb/lib/libhwloc.so;/usr/lib64/libdl.so;/global/common/software/spackecp/perlmutter/e4s-22.05/73973/spack/opt/spack/cray-sles15-zen3/gcc-11.2.0/openblas-0.3.20-kpue5bnywwglz4wnssagsb7wko2mpamg/lib/libopenblas.so;/global/common/software/spackecp/perlmutter/e4s-22.05/73973/spack/opt/spack/cray-sles15-zen3/gcc-11.2.0/openblas-0.3.20-kpue5bnywwglz4wnssagsb7wko2mpamg/lib/libopenblas.so;/global/common/software/spackecp/perlmutter/e4s-22.05/73973/spack/opt/spack/cray-sles15-zen3/gcc-11.2.0/hwloc-2.7.1-qkmm5kxw4bu7hmirkzc566atvkfehkfb/lib/libhwloc.so;/usr/lib64/libdl.so;/global/common/software/spackecp/perlmutter/e4s-22.05/73973/spack/opt/spack/cray-sles15-zen3/gcc-11.2.0/openblas-0.3.20-kpue5bnywwglz4wnssagsb7wko2mpamg/lib/libopenblas.so;/global/common/software/spackecp/perlmutter/e4s-22.05/73973/spack/opt/spack/cray-sles15-zen3/gcc-11.2.0/openblas-0.3.20-kpue5bnywwglz4wnssagsb7wko2mpamg/lib/libopenblas.so;/global/common/software/spackecp/perlmutter/e4s-22.05/73973/spack/opt/spack/cray-sles15-zen3/gcc-11.2.0/hwloc-2.7.1-qkmm5kxw4bu7hmirkzc566atvkfehkfb/lib/libhwloc.so
   Trilinos_TPL_LIBRARY_DIRS = 
   Trilinos_BUILD_SHARED_LIBS = ON
End of Trilinos details

-- The C compiler identification is GNU 11.2.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /opt/cray/pe/mpich/8.1.15/ofi/gnu/9.1/bin/mpicc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- The CXX compiler identification is GNU 11.2.0
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /opt/cray/pe/mpich/8.1.15/ofi/gnu/9.1/bin/mpicxx - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- The Fortran compiler identification is GNU 11.2.0
-- Detecting Fortran compiler ABI info
-- Detecting Fortran compiler ABI info - done
-- Check for working Fortran compiler: /opt/cray/pe/mpich/8.1.15/ofi/gnu/9.1/bin/mpif90 - skipped
-- Configuring done
-- Generating done
-- Build files have been written to: /global/cfs/cdirs/m3503/buildtest/runs/perlmutter_scheduled_test/2022-11-07/perlmutter.slurm.regular/trilinos/trilinos_e4s_testsuite_22.05/b8041801/stage/testsuite/validation_tests/trilinos/build
+ make
[ 50%] Building CXX object CMakeFiles/Zoltan.dir/app.cpp.o
[100%] Linking CXX executable Zoltan
[100%] Built target Zoltan
+ cd -
/global/cfs/cdirs/m3503/buildtest/runs/perlmutter_scheduled_test/2022-11-07/perlmutter.slurm.regular/trilinos/trilinos_e4s_testsuite_22.05/b8041801/stage/testsuite/validation_tests/trilinos
Running /global/cfs/cdirs/m3503/buildtest/runs/perlmutter_scheduled_test/2022-11-07/perlmutter.slurm.regular/trilinos/trilinos_e4s_testsuite_22.05/b8041801/stage/testsuite/validation_tests/trilinos
Skipping load: Environment already setup
+ cd ./build
+ export CUDA_MANAGED_FORCE_DEVICE_ALLOC=1
+ CUDA_MANAGED_FORCE_DEVICE_ALLOC=1
+ export OMP_NUM_THREADS=4
+ OMP_NUM_THREADS=4
+ srun -n 8 ./Zoltan
MPICH ERROR [Rank 0] [job id 3609894.0] [Mon Nov  7 18:29:54 2022] [nid003429] - Abort(-1) (rank 0 in comm 0): MPIDI_CRAY_init: GPU_SUPPORT_ENABLED is requested, but GTL library is not linked
 (Other MPI error)

aborting job:
MPIDI_CRAY_init: GPU_SUPPORT_ENABLED is requested, but GTL library is not linked

srun: error: nid003429: task 0: Exited with exit code 255
srun: launch/slurm: _step_signal: Terminating StepId=3609894.0
srun: error: nid003429: tasks 1-7: Segmentation fault
Run failed

This error doesn't occur when I run the test manually on a compute node. Should we try to figure out what the difference in environment is before trying to adjust the buildspec?

Yes i think that would best way to troubleshoot this issue. I would say it would be better if you can run the test manually via e4s user to rule out any user environment issues between your account and e4s user.

To login via e4s user you need to run sshproxy -c e4s from your laptop and then connect to perlmutter via the username. I have a handly alias to help login.

 ~/ alias perlmutter-e4s
perlmutter-e4s='ssh -i ~/.ssh/e4s e4s@perlmutter-p1.nersc.gov'

@wspear it looks like we fixed this issue in https://software.nersc.gov/NERSC/buildtest-nersc/-/commit/d1522e37536562541b56937fc4fe6832e854f660 but we have a failure in E4S 22.11 https://my.cdash.org/test/81709544 with similar error. Note the change you made was for 22.05 stack. Looking at the buildspec for trilinos https://software.nersc.gov/NERSC/buildtest-nersc/-/blob/devel/buildspecs/e4s/E4S-Testsuite/perlmutter/22.11/trilinos.yml it doesn't have the module load cpu perhaps this should fix the issue.