Running WarpX on DGX A100 GPUs

Question

Running WarpX on DGX A100 GPUs

Opened this issue 3 months ago · 5 comments

Hi, @ax3l , I noticed that the previous PR (#4836) has been merged, so I'm continuing the discussion in the issue.

I have successfully installed WarpX. You recommended against binding HDF5 with ADIOS2, however, I did not follow that advice. I modified my script based on the HPC3 (UCL) example, and my script is:

#!/bin/bash
export proj="ljz_gpu"


export MY_PROFILE=$(cd $(dirname $BASH_SOURCE) && pwd)/$(basename $BASH_SOURCE)


module load gcc/11.3.0-gcc-9.4.0
module load cmake/3.25.2-gcc-4.8.5  
module load cuda/11.8.0-gcc-4.8.5  
#module load openmpi/4.1.5-gcc-9.4.0  
module load intel-oneapi-mpi/2021.8.0-gcc-4.8.5
module load intel-oneapi-compilers/2021.4.0-gcc-4.8.5
module load intel-oneapi-mkl/2021.4.0-gcc-4.8.5
#module load nvhpc/22.11-gcc-4.8.5

module load boost/1.80.0-gcc-9.4.0 

# optional: for openPMD and PSATD+RZ support
module load openblas/0.3.21-gcc-9.4.0
#module load  hdf5/1.14.0-gcc-9.4.0  
export PATH=/ShareData1/App/abinit-dependence/hdf5-1.10.6/bin:$PATH

export CMAKE_PREFIX_PATH=${HOME}/sw/hpc3/gpu/c-blosc-1.21.1:$CMAKE_PREFIX_PATH
export CMAKE_PREFIX_PATH=${HOME}/sw/hpc3/gpu/adios2-2.8.3:$CMAKE_PREFIX_PATH
export CMAKE_PREFIX_PATH=${HOME}/sw/hpc3/gpu/blaspp-master:$CMAKE_PREFIX_PATH
export CMAKE_PREFIX_PATH=${HOME}/sw/hpc3/gpu/lapackpp-master:$CMAKE_PREFIX_PATH

export LD_LIBRARY_PATH=${HOME}/sw/hpc3/gpu/c-blosc-1.21.1/lib64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=${HOME}/sw/hpc3/gpu/adios2-2.8.3/lib64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=${HOME}/sw/hpc3/gpu/blaspp-master/lib64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=${HOME}/sw/hpc3/gpu/lapackpp-master/lib64:$LD_LIBRARY_PATH

export PATH=${HOME}/sw/hpc3/gpu/adios2-2.8.3/bin:${PATH}


module load python/3.10.6-gcc-4.8.5  


if [ -d "${HOME}/sw/hpc3/gpu/venvs/warpx-gpu" ]
then
  source ${HOME}/sw/hpc3/gpu/venvs/warpx-gpu/bin/activate
fi

# an alias to request an interactive batch node for one hour
#   for parallel execution, start on the batch node: srun <command>
alias getNode="salloc -N 1 -t 0:30:00 --gres=gpu:A100:1 -p free-gpu"
# an alias to run a command on a batch node for up to 30min
#   usage: runNode <command>
alias runNode="srun -N 1 -t 0:30:00 --gres=gpu:A100:1 -p free-gpu"


export AMREX_CUDA_ARCH=8.0

# compiler environment hints
export CXX=$(which g++)
export CC=$(which gcc)
export FC=$(which gfortran)
export CUDACXX=$(which nvcc)
export CUDAHOSTCXX=${CXX}

This did not result in any errors during the compilation process. Afterwards, I fully ran the HPC3 document's script to install dependencies and also installed the Python module. However, when I tested running the Ohm Solver: Magnetic Reconnection, I encountered issues such as insufficient memory. My job submission script is:

#!/bin/bash -l

# Copyright 2023 The WarpX Community
#
# This file is part of WarpX.
#
# Authors: Axel Huebl, Victor Flores
# License: BSD-3-Clause-LBNL

#SBATCH --time=08:00:00
#SBATCH --nodes=1
##SBATCH --nodelist=gpu010
#SBATCH -J WarpX
#S BATCH -A <proj>
#SBATCH -p gpup1
# use all four GPUs per node
##SBATCH --ntasks-per-node=8
##SBATCH --gres=gpu:A100:1
##SBATCH --cpus-per-task=10
#SBATCH -o WarpX.o%j
#SBATCH -e WarpX.e%j
ulimit -m unlimited
ulimit -d unlimited
ulimit -s unlimited
#ulimit -p unlimited
cd /public/home/ljz_gpu/warpx_sim
# executable & inputs file or python interpreter & PICMI script here
EXE=/public/home/ljz_gpu/src/warpx/build/bin/warpx.2d
INPUTS=PICMI_inputs.py

# OpenMP threads
#export OMP_NUM_THREADS=16

# run
#srun --ntasks=4 bash -c "
#mpirun --oversubscribe -np 28  bash -c "
#    export CUDA_VISIBLE_DEVICES=\${SLURM_LOCALID};
#    ${EXE} ${INPUTS}" \
#  > output.txt
mpirun --oversubscribe -np 28  /public/home/ljz_gpu/src/warpx/build/bin/warpx.2d PICMI_inputs.py > output.txt

The error file is:
error.txt

My cluster consists of 9 NVIDIA DGX-A100 high-performance computing servers. Each server is equipped with dual AMDROME 7742 64C128T processors, 1TB DDR4 memory, 8 NVIDIA TESLA A100 40GB SMX4 acceleration cards, 8 single-port 200Gb HDR high-speed network interfaces, 1 dual-port 100Gb EDR high-speed network interface, and 19TB of all-SSD storage space. The platform in total has 1152 CPU cores, 72 GPUs, and theoretical FP32 and FP64 computing capabilities exceeding 1404 TFLOPS and 702 TFLOPS, respectively, with a total storage capacity of over 170TB. Using Slurm and providing Jupyter service.

Answer 1 · 2024-04-11T08:38:49.000Z

Hi @ax3l ! After discussing my issue with the HPC administrator, we tried recompiling from the environmental setup twice with their assistance, yet the error still occurred. Therefore, I would like to ask if there are any specific installation steps for clusters not mentioned in the official documentation. I am willing to install all dependencies myself. Are there any recommended versions of compilers, as well as recommended versions of CUDA and other related recommendations? Once I succeed, I would be very eager to share my installation process, as I believe it could help others who are new to the installation.

Answer 2 · 2024-04-12T13:20:57.000Z

After efforts with the HPC administrator today, we have successfully installed Warpx, but we encountered new problems: 1. Running most of the Input Files is successful, but when running the example Input File for 'Laser-Ion Acceleration with a Planar Target,' it errors out and fails to run. 2. We are unable to run Python scripts. After opening the Python interface, no executable file is generated, and when the input file is a Python script, it results in errors. The command I used is :

mpirun -np 4 /public/home/ljz_gpu/sw/hpc3/gpu/venvs/warpx-gpu/bin/python3 /public/home/ljz_gpu/warpx/Examples/Tests/ohm_solver_magnetic_reconnection/PICMI_inputs.py > output.txt."

The error file is below：
WarpX.e103203.txt
Backtrace.0.txt
Backtrace.1.txt
Backtrace.2.txt
Backtrace.3.txt

Looking forward to your reply!

Answer 3 · 2024-04-15T23:21:36.000Z

What I would start with: Note that WarpX uses 1 MPI rank per GPU. So for your job script above, where you use 1 node, this should read:

mpirun -np 8  /public/home/ljz_gpu/src/warpx/build/bin/warpx.2d PICMI_inputs.py > output.txt

Do not oversubscribe, we do not support that.

If this still segfaults, then please repeat with a single MPI rank and also post the backtrace files.

Answer 4 · 2024-04-15T23:22:54.000Z

Another point to check: in your backtrace, I see mpi4py issues. This might be from the oversubscription or another issue. Double check that the compilers and MPI used to build WarpX are the same as used to build mpi4py.

Answer 5 · 2024-04-17T12:28:48.000Z

Hi, @ax3l ！
Based on your advice, I am now able to successfully run the Python script for the "Parallel propagating waves" section in Ohm solver: Electromagnetic modes. However, when attempting to run the Python script for "Perpendicular propagating waves," I encounter an error. I used the submission command:

mpirun -np 8  /public/home/ljz_gpu/sw/hpc3/gpu/venvs/warpx-gpu/bin/python3 /public/home/ljz_gpu/src/warpx/Examples/Tests/ohm_solver_EM_modes/PICMI_inputs.py -d 3 --bdir x > output.txt

The error file reads:
output.txt
Warpx.e105046.txt
Backtrace.0.txt

When running other input files or Python scripts, similar errors occur. After checking, the compilers are consistent. This confuses me.