Alpine-DAV/ascent

build_ascent.sh failures

Closed this issue · 18 comments

On latest develop branch e0100bf5, running env enable_mpi=ON install_dir=/path/to/install build_jobs=10 ./scripts/build_ascent/build_ascent.sh ends with the following error in the ascent configure step:

**** Creating Ascent host-config (ascent-config.cmake)
**** Configuring Ascent
loading initial cache file /lustre/orion/ard174/proj-shared/mlohry/ascent-test/ascent/ascent-config.cmake
-- The C compiler identification is Clang 18.1.6
-- The CXX compiler identification is Clang 18.1.6
-- Cray Programming Environment 2.7.32 C
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /opt/cray/pe/craype/2.7.32/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Cray Programming Environment 2.7.32 CXX
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /opt/cray/pe/craype/2.7.32/bin/CC - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
CMake Error at cmake/SetupBLT.cmake:43 (include):
  include could not find requested file:

    blt/SetupBLT.cmake
Call Stack (most recent call first):
  CMakeLists.txt:119 (include)


CMake Error at cmake/SetupBLT.cmake:69 (message):
  Cannot use CMake imported targets for MPI.(ENABLE_MPI == ON, but
  MPI::MPI_CXX CMake target is missing.)
Call Stack (most recent call first):
  CMakeLists.txt:119 (include)


-- Configuring incomplete, errors occurred!
See also "/lustre/orion/ard174/proj-shared/mlohry/ascent-test/ascent/build/ascent-checkout/CMakeFiles/CMakeOutput.log".

@mlohry for the cray compiler wrappers, we have to tell CMake that mpi will magically work and it shouldn't look for it.

I am working on a path that uses the cray compiler wrappers as well as the mpich mpi compiler wrappers for Frontier w/ the new modules.

Can you confirm what modules would be ideal for your case?

In the short term really any CPU build with a working replay_mpi on frontier would be of use to look at some large blueprint saves.

The above error was from the default frontier modules. The modules used for our solver case on frontier are these:

module load craype-x86-trento perftools-base/24.07.0 libfabric/1.20.1 cpe/24.07 craype-network-ofi rocm/6.2.0 xpmem/2.8.4-1.0_7.3__ga37cbd9.shasta gcc-native/13.2 Core/24.00 craype/2.7.32 tmux/3.2a cray-dsmml/0.3.0 hsi/default cray-mpich/8.1.30 lfs-wrapper/0.0.1 cray-libsci/24.07.0 DefApps PrgEnv-gnu/8.5.0 cray-pmi/6.1.15.21 craype-accel-amd-gfx90a

so ideally those (gcc 13) if it prevents issues. Running with that set,

[mlohry@login07.frontier ascent]$ module list

Currently Loaded Modules:
  1) craype-x86-trento        5) craype-network-ofi                     9) Core/24.00         13) DefApps            17) cray-libsci/23.12.5      21) cmake/3.23.2
  2) perftools-base/24.07.0   6) rocm/6.2.0                            10) tmux/3.2a          14) craype/2.7.31.11   18) PrgEnv-gnu/8.5.0
  3) cpe/24.07                7) xpmem/2.8.4-1.0_7.3__ga37cbd9.shasta  11) hsi/default        15) cray-dsmml/0.2.2   19) cray-pmi/6.1.15.21
  4) libfabric/1.20.1         8) gcc-native/13.2                       12) lfs-wrapper/0.0.1  16) cray-mpich/8.1.28  20) craype-accel-amd-gfx90a

Inactive Modules:
  1) darshan-runtime

HDF5 hits a linker error during the build,

/lustre/orion/ard174/proj-shared/mlohry/ascent-test/ascent/build/hdf5-1.14.1-2/bin/H5make_libsettings: error while loading shared libraries: libamdhip64.so.5: cannot open shared object file: No such file or directory
gmake[2]: *** [src/CMakeFiles/gen_hdf5-static.dir/build.make:85: src/gen_SRCS.stamp2] Error 127
gmake[2]: *** Waiting for unfinished jobs....
/lustre/orion/ard174/proj-shared/mlohry/ascent-test/ascent/build/hdf5-1.14.1-2/bin/H5detect: error while loading shared libraries: libamdhip64.so.5: cannot open shared object file: No such file or directory
gmake[2]: *** [src/CMakeFiles/gen_hdf5-static.dir/build.make:80: src/gen_SRCS.stamp1] Error 127
gmake[1]: *** [CMakeFiles/Makefile2:2042: src/CMakeFiles/gen_hdf5-static.dir/all] Error 2
gmake[1]: *** Waiting for unfinished jobs....
/lustre/orion/ard174/proj-shared/mlohry/ascent-test/ascent/build/hdf5-1.14.1-2/bin/H5make_libsettings: error while loading shared libraries: libamdhip64.so.5: cannot open shared object file: No such file or directory
gmake[2]: *** [src/CMakeFiles/gen_hdf5-shared.dir/build.make:97: src/gen_SRCS.stamp2] Error 127
gmake[2]: *** Waiting for unfinished jobs....
/lustre/orion/ard174/proj-shared/mlohry/ascent-test/ascent/build/hdf5-1.14.1-2/bin/H5detect: error while loading shared libraries: libamdhip64.so.5: cannot open shared object file: No such file or directory
gmake[2]: *** [src/CMakeFiles/gen_hdf5-shared.dir/build.make:92: src/gen_SRCS.stamp1] Error 127
gmake[1]: *** [CMakeFiles/Makefile2:2095: src/CMakeFiles/gen_hdf5-shared.dir/all] Error 2

The (recently updated) rocm/6.2.0 module has libamdhip64.so.6 not .5. Loading module rocm/6.1.3 fixes the HDF5 problem, but then later I hit the original error of missing SetupBLT.cmake. I also noticed at least some of the builds eg RAJA seem to be picking up GCC 7.5.0 from /usr/bin/c++, not 13 as expected there. Although the compiler wrapper /opt/cray/pe/craype/2.7.32/bin/cc does point to gcc-13.2.1

Trying to minimize the modules a bit,

module load Core/24.00 # makes cmake available
module load cmake # loads cmake/3.23.2
module load PrgEnv-gnu

[mlohry@login07.frontier ascent]$ module list

Currently Loaded Modules:
  1) Core/24.00     3) gcc-native/12.3    5) cray-dsmml/0.2.2   7) craype-network-ofi   9) cray-libsci/23.12.5
  2) cmake/3.23.2   4) craype/2.7.31.11   6) libfabric/1.20.1   8) cray-mpich/8.1.28   10) PrgEnv-gnu/8.5.0

I still hit the original build error of BLT_SOURCE_DIR not being defined correctly (where do you normally pick this up from? It's not being exported to ascent-config.cmake.)

@mlohry

Thanks for the details - try this branch:

https://github.com/Alpine-DAV/ascent/tree/task/2024_09_frontier

run:

https://github.com/Alpine-DAV/ascent/blob/task/2024_09_frontier/scripts/build_ascent/build_ascent_hip_frontier.sh

It's not the same modules you need for the integrated case, but I was able to run ascent mpi tests (I tried two ranks) successfully.

Here are the modules that need to be loaded to run (from the top of the build frontier script)

module load cmake #3.23.2
module load PrgEnv-cray
module load craype-accel-amd-gfx90a
module load rocm/5.7.1
module load cray-mpich/8.1.28
module load cce/17.0.0
module load cray-python/3.11.5

If you see BLT_SOURCE_DIR missing, that means you missed --recursive on the git clone.

You can fix that with:

git submodule init
git submodule update

Looking again -- I think the MPI issue confused me -- seeing that BLT_SOURCE_DIR is missing would also cause that MPI error. Submodule update should fix that.

@cyrush thanks that built, but back to the original issue I hit -- when I execute ascent_replay_mpi it fails in the file check step:

terminate called after throwing an instance of 'conduit::Error'
  what():
file: /lustre/orion/ard174/proj-shared/mlohry/ascent-test/ascent/src/utilities/replay/replay.cpp
line: 214
message:
Actions file not found: ascent_actions_relay_no_boundary.yaml

srun: error: frontier06127: task 1: Aborted

That file exists and rank 0 sees it, but the mpi broadcast of the bool seems to leave the other ranks seeing false. Looks like that code is fairly recent:
e504a28

Are you able to successfully run ascent_replay_mpi?

Ok - sounds like a new bug & system MPI is ok. We did have a change, looking into it.

@mlohry On the /task/2024_09_frontier branch -- I changed the actions checking logic in relay to match another implementation we have. Can you see if this resolves your issue?

Sitting in queues, will let you know.

What is MPI_BOOL in that code?

The other code uses MPI_INT instead of MPI_BOOL.

I was wondering where MPI_BOOL was actually being defined, since that's not an MPI datatype and I'm not seeing it grepping the dependencies.

The latest branch looks like it might have worked, but the post I expected to take 10 minutes on 471 nodes timed out after 60 minutes and didn't produce any images so can't tell if it was hanging or not. I'll try it again on a smaller dataset.

That is a great question. The ifdef was wrong, so MPI_BOOL was not getting compiled. That makes a bit more sense.
I pushed another fix to the frontier branch.

relay::mpi::broadcast_using_schema(actions, 0, mpi_comm);

missing conduit::

@@ -232,7 +232,7 @@ void load_actions(const std::string &file_name, int mpi_comm_id, conduit::Node &
                      << "\n" << emsg);
     }
 #ifdef ASCENT_REPLAY_MPI
-    relay::mpi::broadcast_using_schema(actions, 0, mpi_comm);
+    conduit::relay::mpi::broadcast_using_schema(actions, 0, mpi_comm);
 #endif
 }

pushed fix -- checked compile and it worked.

aside -- trying to build a past working version commit a5f51b,

git clone --recursive https://github.com/Alpine-DAV/ascent.git
git checkout a5f51b
cd ascent
env enable_mpi=ON ./scripts/build_ascent/build_ascent.sh

this check

if [ -d ${ascent_checkout_dir} ]; then
fails, and it does a new clone of ascent-develop and ends up building that, not the checked out branch.

@mlohry Looked into this: the logic to use existing checkout was added after a5f51b (#1324) -- it was part of: #1339
So I think that explains that specific issue.

Overall develop (or the frontier branch) should be the best -- but sorry for the bumps in the road with the recent replay bugs. We are planning to add extensive replay testing. Since the ifdef typos have happened twice now we will need to think of a good way to protect from those errors, which the compiler doesn't help us with :-)

@mlohry I think we addressed the frontier build issues and the replay woes, please let me know if there is anything else blocking your builds.