Building with Spack with CUDA and OpenMP
Opened this issue · 6 comments
I'm trying to build Palace with GPU (CUDA) and OpenMP support using Spack.
The package file is same as palace/spack/local/packages/palace/package.py at main · awslabs/palace.
My installation command is spack install palace +cuda cuda_arch=86 +openmp
spack spec
result: palace-spec.txt
Problem with OpenMP
After changing command palace -np 64 2DQv9_eb4_3d_resonator_eigen.json -launcher-args "--use-hwthread-cpus"
to command palace -nt 64 2DQv9_eb4_2d_resonator_eigen.json
, the following error occurs:
...
Git changeset ID: d03e1d9 Running with 1 MPI process, 64 OpenMP threads Detected 1 CUDA device
Device configuration: omp,cpu Memory configuration: host-std
libCEED backend: /cpu/self/xsmm/blocked
...
Configuring SLEPc eigenvalue solver:
Scaling γ = 6.087e+02, δ = 7.724e-06
Configuring divergence-free projection
Using random starting vector
Verification failed: (!err_flag) is false:
--> Error during setup! Error code: 1
... in function: virtual void mfem::HypreSolver::Setup(const mfem::HypreParVector&, mfem::HypreParVector&) const
... in file: /tmp/lesnow/spack-stage/spack-stage-palace-develop-pkce5vp2bxzmswrs324vma4hf56do3ip/spack-build-pkce5vp/extern/mfem/linalg/hypre.cpp:4038
Abort(1) on node 0 (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
[WARNING] yaksa: 9 leaked handle pool objects
The same configuration file works fine under Palace@0.13.0 with default setup. (No OpenMP and CUDA)
Problem with GPU
When setting ["Solver"]["Device"] = "GPU", the following error occurs
spack-build-pkce5vp/extern/libCEED/backends/ceed-backend-weak.c:15 in CeedInit_Weak(): Backend not currently compiled: /gpu/cuda/magma
Consult the installation instructions to compile this backend
LIBXSMM_VERSION: feature_int4_gemms_scf_zpt_MxK-1.17-3727 (25693839)
LIBXSMM_TARGET: clx [Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz]
Registry and code: 13 MB
Command: /home/lesnow/spack/opt/spack/linux-ubuntu20.04-cascadelake/gcc-9.4.0/palace-develop-pkce5vp2bxzmswrs324vma4hf56do3ip/bin/palace-x86_64.bin 2DQv9_eb4_2d_resonator_eigen_gpu.json
Uptime: 1.496896 s
Environment
Linux amax 5.15.0-91-generic #101~20.04.1-Ubuntu SMP Thu Nov 16 14:22:28 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:3B:00.0 Off | N/A |
| 30% 34C P8 21W / 220W | 382MiB / 8192MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1929 G /usr/lib/xorg/Xorg 53MiB |
| 0 N/A N/A 3658 G /usr/lib/xorg/Xorg 167MiB |
| 0 N/A N/A 3885 G /usr/bin/gnome-shell 62MiB |
| 0 N/A N/A 4195 G ...bexec/gnome-initial-setup 3MiB |
| 0 N/A N/A 4224 G ...2gtk-4.0/WebKitWebProcess 20MiB |
+-----------------------------------------------------------------------------+
Compiler: palace-spec.txt
Is it an issue with Spack package file or with my local environment? Could you please suggest a solution for this issue? Thanks!
Hi @LeSnow-Ye,
I'm sorry to hear that you're having issues with the build. It would be helpful to try and narrow down the cross section of the failure to one I can reproduce to try and figure it out.
For the openmp issue: can you try and run one of the cases from the examples
folder, preferably the same problem type as the one used in your config file, then can you also check if running a) with -nt 1 with openmp built runs and 2) the minimum number N for which -nt N fails. The issue appears to be in the hypre, but unfortunately from the docs that's a generic error code which isn't very helpful.
For the gpu build, I am not sure as I have not been able to build that myself. Building with GPUs is currently fairly arcane, and there's likely issues with that spack script (hence it not being uploaded to the main spack repo yet) as we have not finished testing it yet. From that error message, it would seem that the magma build is not being triggered. We currently don't have enough resource to get to further testing for GPUs, but given you are working on them we would very much appreciate any suggested fixes to that spack file you might suggest.
Hi @hughcars,
Thanks for your reply.
The OpenMP issue seems to be narrowed down to Eigenmode Problems. The cavity example failed as well from -nt 1
, but other examples seem fine.
For the GPU build, I'd love to help. But currently I'm not so familiar with the build process. If I make any progress or find any potential fixes, I will try to submit a Pull Request with the updates. Your assistance in this area would be greatly appreciated.
Yesterday I made a manual openmp build to check that, and though I uncovered some issues (#279) during testing with apple mac m1, the cases did run without your error message however, and all examples ran perfectly with -nt 1
. When I get some more bandwidth I will try debugging the spack build, as my initial attempts have failed.
For your gpu question, have you tried +magma
?
Hi, @hughcars,
Thanks for your reply.
Somehow when I use +cuda cuda_arch=<xx>
together with +magma
, conflicts are detected.
==> Error: concretization failed for the following reasons:
1. magma: conflicts with 'cuda_arch=86'
2. magma: conflicts with 'cuda_arch=86'
required because conflict constraint
required because palace depends on magma+cuda cuda_arch=86 when +cuda+magma cuda_arch=86
required because palace+cuda+magma+openmp cuda_arch=86 requested explicitly
required because palace depends on magma+shared when +magma+shared
required because conflict is triggered when cuda_arch=86
required because palace depends on magma+cuda cuda_arch=86 when +cuda+magma cuda_arch=86
required because palace+cuda+magma+openmp cuda_arch=86 requested explicitly
required because palace depends on magma+shared when +magma+shared
I think it might be unnecessary to do add +magma
when we have +cuda
according to the script palace/package.py ?
...
with when("+magma"):
depends_on("magma")
depends_on("magma+shared", when="+shared")
depends_on("magma~shared", when="~shared")
...
with when("+cuda"):
for arch in CudaPackage.cuda_arch_values:
cuda_variant = f"+cuda cuda_arch={arch}"
depends_on(f"hypre{cuda_variant}", when=f"{cuda_variant}")
depends_on(f"superlu-dist{cuda_variant}", when=f"+superlu-dist{cuda_variant}")
depends_on(f"strumpack{cuda_variant}", when=f"+strumpack{cuda_variant}")
depends_on(f"slepc{cuda_variant} ^petsc{cuda_variant}", when=f"+slepc{cuda_variant}")
depends_on(f"magma{cuda_variant}", when=f"+magma{cuda_variant}")
Maybe I should try building manually later.
Hi, @hughcars,
The CUDA problem is because MAGMA in Spack currently has poor support for many specific cuda_arch
values.
# Many cuda_arch values are not yet recognized by MAGMA's CMakeLists.txt
for target in [10, 11, 12, 13, 21, 32, 52, 53, 61, 62, 72, 86]:
conflicts("cuda_arch={}".format(target))
In the newly updated master branch of MAGMA, more valid architectures are acceptable. See icl / magma / CMakeLists.txt — Bitbucket.
So, by manually removing the needed cuda_arch
value from the list above and switching to magma@master
, my CUDA problem could be solved.
Hi, @hughcars,
The CUDA problem is because MAGMA in Spack currently has poor support for many specific
cuda_arch
values.# Many cuda_arch values are not yet recognized by MAGMA's CMakeLists.txt for target in [10, 11, 12, 13, 21, 32, 52, 53, 61, 62, 72, 86]: conflicts("cuda_arch={}".format(target))
In the newly updated master branch of MAGMA, more valid architectures are acceptable. See icl / magma / CMakeLists.txt — Bitbucket.
So, by manually removing the needed
cuda_arch
value from the list above and switching tomagma@master
, my CUDA problem could be solved.
Ooof, building with gpus is very fiddly in our experience so far. I'm very glad you managed to get to the root cause!