LAMMPS-GPU Benchmark-Cuda driver error 4 in call at file ‘geryon/nvd_device.h

Question

LAMMPS-GPU Benchmark-Cuda driver error 4 in call at file ‘geryon/nvd_device.h

DaveiV opened this issue 5 months ago · 4 comments

DaveiV commented 5 months ago

Summary
Got error about Cuda driver

mpirun --allow-run-as-root -n 32 lmp -sf gpu -pk gpu 2 -restart2data lmp.restart remap lmp_final.data

LAMMPS Version and Platform

LAMMPS version 20230802.1 and 20230802.2

Details

Noted:

I got this error when run LAMMPS version 20230802.1 and 20230802.2 with cuda
But when run LAMMPS version 20230802.1 and 20230802.2 with opencl then no error appears , The program runs smoothly but performance will be decrease by about 20%
When I run LAMMPS version 20220623.4 with cuda the error not appears

Compare LAMMPS version 20230802 is different from LAMMPS version 20220602 in file lib/gpu/geryon/nvd_device.h. Some function added in this file

-Version 20230802


void UCL_Device::clear() {
  if (_device > -1) {
    for (int i = 1; i < num_queues(); i++) 
        pop_command_queue();

#if GERYON_NVD_PRIMARY_CONTEXT
    CU_SAFE_CALL_NS(cuCtxSetCurrent(_old_context));
    CU_SAFE_CALL_NS(cuDevicePrimaryCtxRelease(_cu_device));
#else
    cuCtxDestroy(_context);
#endif
    _device = -1;
}

-Version 20220623

void UCL_Device::clear() {
  if (_device > -1) {
    for (int i = 1; i < num_queues(); i++) 
        pop_command_queue();

  cuCtxDestroy(_context);

  _device = -1;
}

Currently I' m using Spack to build LAMMPS

spack graph lammps@20230802.1%aocc@4.1.0+cuda cuda_arch=90 fftw_precision=single target=zen4 +extra-dump +granular +kspace +manybody +meam +molecule +opt +replica +rigid +openmp +openmp-package ^amdfftw %aocc@4.1.0 ^ucx@1.15.0 %aocc@4.1.0 +xpmem+verbs+ud+rc+mlx5_dv+cuda cuda_arch=80 ^openmpi@4.1.5 %aocc@4.1.0 +cuda cuda_arch=80 fabrics=ucx

Answer 1 · 2024-01-16T05:22:25.000Z

mpirun --allow-run-as-root -n 32 lmp -sf gpu -pk gpu 2 -restart2data lmp.restart remap lmp_final.data

There is no GPU package acceleration used for this command. So then only difference you are measuring in term of time is the difference between one time initialization of the GPU.

In fact, there is little benefit from using MPI parallelization at all (and there is little benefit in general for using 16 MPI processes per GPU either unless you compile and run with CUDA multiprocessor server support via -DCUDA_MPS_SUPPORT).

So please test with the in.lj or in.rhodo or in.eam examples in the bench folder and use only 2-8 MPI processes.

Answer 2 · 2024-01-16T08:48:03.000Z

Dear @akohlmey .
About command
mpirun --allow-run-as-root -n 32 lmp -sf gpu -pk gpu 2 -restart2data lmp.restart remap lmp_final.data
May I confirm you mean this command not suppport GPU package acceleration . It only works with CPU, right?
And why, when I run this command with LAMMPS version 20220623, there is no error, but when I run it with version 20230802, I encounter this error?.

About LAMMPS with MPI parallelization , I will try -DCUDA_MPS_SUPPORT after fix error above

I have tested and successfully run the in.lj, in.rhodo, and in.eam files.
Thank you

Answer 3 · 2024-01-16T14:40:16.000Z

May I confirm you mean this command not suppport GPU package acceleration .

It does not use GPU acceleration. The -restart2data command line flag as used in your example is equivalent to an input file with:

read_restart lmp.restart remap
write_data lmp_final.data noinit

It only works with CPU, right?

It sets up a calculation with GPU acceleration, but then does not properly initialize everything and never uses the GPU. The operations in use are heavily I/O bound and thus not parallelizable. Enabling the GPU for this leads to undefined behavior. Now, LAMMPS might be changed to handle this case more gracefully, but first and foremost, this is a user error or trying to enable functionality that (very obviously) has no meaning in the application of it.

One more comment: your use of the flag --allow-run-as-root suggests that you are running as root. This is a very, very, VERY bad idea. Running MPI as root makes no sense since parallel applications are not a system management thing and in combination with LAMMPS it is particularly dangerous since LAMMPS has facilities that may delete or modify files: one typo and you may destroy your entire installation and render the server unusable to the point of requiring an installation from scratch.

Answer 4 · 2024-01-17T03:06:24.000Z

Dear @akohlmey .
About -restart2data command line flag I will use with CPU.
Thank you for explaining it to me about --allow-run-as-root .