UCL-RITS/rcps-buildscripts

Install Request: LAMMPS 15th June 2023 release but doing 2nd August 2023 [IN06098486]

Opened this issue · 47 comments

The 15th June 2023 release includes for the first time support to output vector style variables during a simulation run which this research group needs.

It looks like the latest version in Spack is 8 Feb 2023.

https://www.lammps.org/download.html

Ticket now IN:06149543.

LAMMPS 2nd August 2023 is now the latest release so install this one.

A build of LAMMPS 2nd August 2023 using GNU compilers and FFTW is now on Myriad and Young using our build scripts method:

module -f unload compilers mpi gcc-libs
module load beta-modules
./lammps-2Aug2023-basic-fftw-gnu_install 2>&1 | tee ~/Software/LAMMPS/lammps-2Aug2023-basic-fftw-gnu_install.log-1

Need to produce a module file and run some tests next.

I now have the module file on Myriad and Young and have submitted test jobs on both clusters.

Both the test jobs on Myriad and Young worked. Modules needed for basic GNU FFTW version are:

Myriad

module -f unload compilers mpi gcc-libs
module load beta-modules
module load gcc-libs/10.2.0
module load compilers/gnu/10.2.0
module load numactl/2.0.12
module load binutils/2.36.1/gnu-10.2.0
module load ucx/1.9.0/gnu-10.2.0
module load mpi/openmpi/4.0.5/gnu-10.2.0
module load python3/3.9-gnu-10.2.0
module load fftw/3.3.9/gnu-10.2.0
module load lammps/2aug23/basic-fftw/gnu-10.2.0 

Young

module -f unload compilers mpi gcc-libs
module load beta-modules
module load gcc-libs/10.2.0
module load compilers/gnu/10.2.0
module load mpi/openmpi/4.0.5/gnu-10.2.0
module load python3/3.9-gnu-10.2.0
module load fftw/3.3.9/gnu-10.2.0
module load lammps/2aug23/basic-fftw/gnu-10.2.0

Now working on the GNU + GPU build.

Build script updated and pulled to Young. Needs to be built on a GPU node so job submitted to build LAMMPS 2nd August 2023 GNU+GPU on Young. Build script:

lammps-2Aug2023-gpu-gnu_install

Build job for LAMMPS 2nd August 2023 GNU+GPU submitted on Myriad as well.

Both jobs are running.

CPU build done on Kathleen and test job submitted.

I've only had time today to check the output from the test job on Kathleen. It looks like it has worked ok.

Look at:

/home/ccspapp/Software/LAMMPS/tmp.2P0AWpwdjR/lammps-2Aug2023/cmake/presets/most.cmake

for list of LAMMPS packages in our default CPU builds.

I had to redo the GPU builds on Myriad and Young as I had missed out the FFTW module.

The Myriad build has completed and a job running the GPU unit tests has been submitted.

Young build job is still waiting.

Test jobs for the GPU build have been submitted on Myriad and Young.

I've also been trying a build of the basic Intel version but this is failing during compilation:

/dev/shm/ccspapp/lammps/tmp.v9Hhxi8WAq/lammps-stable_2Aug2023/build/_deps/googletest-src/googletest/include/gtest/gtest-matchers.h(434): error: namespace "std" has no member "is_trivially_copy_constructible"
             std::is_trivially_copy_constructible<M>::value &&
                  ^
          detected during:
            processing of template argument list for "testing::internal::MatcherBase<T>::ValuePolicy [with T=const std::string &]" based on template argument <MM> at line 483
            instantiation of "void testing::internal::MatcherBase<T>::Init(M &&) [with T=const std::string &, M=const testing::MatcherInterface<const std::string &> *&]" at line 312
            instantiation of "testing::internal::MatcherBase<T>::MatcherBase(const testing::MatcherInterface<U> *) [with T=const std::string &, U=const std::string &]" at line 536

/dev/shm/ccspapp/lammps/tmp.v9Hhxi8WAq/lammps-stable_2Aug2023/build/_deps/googletest-src/googletest/include/gtest/gtest-matchers.h(434): error: type name is not allowed
             std::is_trivially_copy_constructible<M>::value &&
                                                  ^
          detected during:
            processing of template argument list for "testing::internal::MatcherBase<T>::ValuePolicy [with T=const std::string &]" based on template argument <MM> at line 483
            instantiation of "void testing::internal::MatcherBase<T>::Init(M &&) [with T=const std::string &, M=const testing::MatcherInterface<const std::string &> *&]" at line 312
            instantiation of "testing::internal::MatcherBase<T>::MatcherBase(const testing::MatcherInterface<U> *) [with T=const std::string &, U=const std::string &]" at line 536

/dev/shm/ccspapp/lammps/tmp.v9Hhxi8WAq/lammps-stable_2Aug2023/build/_deps/googletest-src/googletest/include/gtest/gtest-matchers.h(434): error: the global scope has no "value"
             std::is_trivially_copy_constructible<M>::value &&
                                                      ^
          detected during:
            processing of template argument list for "testing::internal::MatcherBase<T>::ValuePolicy [with T=const std::string &]" based on template argument <MM> at line 483
            instantiation of "void testing::internal::MatcherBase<T>::Init(M &&) [with T=const std::string &, M=const testing::MatcherInterface<const std::string &> *&]" at line 312
            instantiation of "testing::internal::MatcherBase<T>::MatcherBase(const testing::MatcherInterface<U> *) [with T=const std::string &, U=const std::string &]" at line 536

compilation aborted for /dev/shm/ccspapp/lammps/tmp.v9Hhxi8WAq/lammps-stable_2Aug2023/build/_deps/googletest-src/googletest/src/gtest-all.cc (code 2)
make[2]: *** [_deps/googletest-build/googletest/CMakeFiles/gtest.dir/src/gtest-all.cc.o] Error 2
make[2]: Leaving directory `/dev/shm/ccspapp/lammps/tmp.v9Hhxi8WAq/lammps-stable_2Aug2023/build'
make[1]: *** [_deps/googletest-build/googletest/CMakeFiles/gtest.dir/all] Error 2
make[1]: Leaving directory `/dev/shm/ccspapp/lammps/tmp.v9Hhxi8WAq/lammps-stable_2Aug2023/build'
make: *** [all] Error 2

Using Intel 2020 compilers.

I would use compilers/intel/2022.2 and not 2020 for anything (because of newer gcc underneath).

Updated Intel build to use gcc-libs/10.2.0 and Intel 2022.2:

module -f unload compilers mpi gcc-libs
module load beta-modules
BUILD_UNIT_TESTS=yes ./lammps-2Aug2023-basic_install 2>&1 | tee ~/Software/LAMMPS/lammps-2Aug2023-basic_install.log-2

The test jobs for the GNU + GPU version have run successfully on Myriad and Young.

The basic Intel build on Myriad completed without errors using Intel 2022.2 compilers. It will need testing now.

I've submitted a test job for the basic Intel version on Myriad.

The LAMMPS 2nd August 2023 basic Intel version test job runs on Myriad. I'm now going to build this version on Kathleen and Young.

The builds on Kathleen and Young have finished. Will now need to check for errors and run a 2 node or bigger test job.

Two node test job for the basic Intel version submitted on Kathleen.

Two node test job for the basic Intel version submitted on Young.

The Kathleen job has been running for 6 hours (set for about 12). The Young one is still queueing.

Both jobs finished overnight and look ok. The Kathleen one was a bigger job and did 20,000 in about 8 hours and the smaller Young one 2000 steps in 48 minutes. I'll upload a module file for the basic Intel version.

module file updated and loaded onto Kathleen, Myriad and Young.

To use LAMMPS 2nd August 2023 version basic Intel build you need the following modules:

module -f unload compilers mpi gcc-libs
module load beta-modules
module load gcc-libs/10.2.0
module load compilers/intel/2022.2
module load mpi/intel/2019/update6/intel
module load python/3.9.10
module load lammps/2aug23/basic/intel-2022.2

Doing the build with the INTEL package next. On Kathleen first:

module -f unload compilers mpi gcc-libs
module load beta-modules
./lammps-2Aug2023-INTEL_install 2>&1 | tee ~/Software/LAMMPS/lammps-2Aug2023-INTEL_instal.log

The INTEL build on Kathleen has completed without errors.

I have a test job submitted for the INTEL build on Kathleen.

It has started to run:

----------------------------------------------------------
Using INTEL Package without Coprocessor.
Compiler: Intel Classic C++ 20.21.6 / Intel(R) C++ g++ 10.2 mode
SIMD compiler directives: Enabled
Precision: mixed

waiting to see how it runs overnight - long test run with 20,000 steps.

Job ran to completion and the speed up is quite good. 3 hours 15 minutes for the INTEL package version with about 8 hours for the basic Intel build.

now to build the Intel variant on Young.

build on Young finished with out errors. Test job submitted.

Test job is still queuing so I will check results tomorrow.

The job failed because I made a mistake in my job script. I've corrected it and re-submitted the job.

I'm getting the build script for the Intel GPU variant ready to submit as a job on Young from ccspapp.

Build job for the Intel GPU variant submitted. Job script is:

/home/ccspapp/Software/LAMMPS/build-intel-gpu-2Aug2023.sh

Test job of the INTEL package variant worked this time. Took about 20 minutes to run as opposed to 48 minutes for the basic Intel variant.

The module file for the 2nd August 2023 version INTEL package variant has been uploaded to Kathleen and Young. To use the INTEL package variant the following module commands are needed:

module -f unload compilers mpi gcc-libs
module load beta-modules
module load gcc-libs/10.2.0
module load compilers/intel/2022.2
module load mpi/intel/2019/update6/intel
module load python/3.9.10
module load lammps/2aug23/userintel/intel-2022.2

The Intel GPU build job ran overnight but failed with:

      Options:       -xHost;-fp-model;fast=2;-no-prec-div;-qoverride-limits;-diag-disable=10441;-diag-disable=2196
In file included from /shared/ucl/apps/cuda/11.3.1/gnu-10.2.0/include/cuda_runtime.h(83),
                 from /home/ccspapp/Scratch/lammps/2Aug2023/gpumixed/tmp.ZQuVwdmCED/lammps-stable_2Aug2023/lib/gpu/lal_zbl.cu(0):
/shared/ucl/apps/cuda/11.3.1/gnu-10.2.0/include/crt/host_config.h(110): error: #error directive: -- unsupported ICC configuration! Only ICC 15.0, ICC 16.0, ICC 17.0, ICC 18.0 and ICC 19.x on Linux x86_64 are supported! The nvcc flag '-allow-unsupported-compiler' can be used to override this version check; however, using an unsupported host compiler may cause compilation failure or incorrect run time execution. Use at your own risk.
  #error -- unsupported ICC configuration! Only ICC 15.0, ICC 16.0, ICC 17.0, ICC 18.0 and ICC 19.x on Linux x86_64 are supported! The nvcc flag '-allow-unsupported-compiler' can be used to override this version check; however, using an unsupported host compiler may cause compilation failure or incorrect run time execution. Use at your own risk.
   ^

CMake Error at cuda_compile_fatbin_1_generated_lal_zbl.cu.fatbin.RelWithDebInfo.cmake:212 (message):
  Error generating
  /home/ccspapp/Scratch/lammps/2Aug2023/gpumixed/tmp.ZQuVwdmCED/lammps-stable_2Aug2023/build/cuda_compile_fatbin_1_generated_lal_zbl.cu.fatbin


make[2]: *** [cuda_compile_fatbin_1_generated_lal_zbl.cu.fatbin] Error 1
make[1]: *** [CMakeFiles/gpu.dir/all] Error 2
make: *** [all] Error 2

will need to investigate tomorrow now.

Switched to using CUDA 11.8.0 instead of 11.3.1. I had to install this version first as it wasn't on Young. The build has finished with out errors so I'm running a test job next.

Intel GPU variant test job submitted.

My test job failed because I hadn't got the module loads correct. I've now re-submitted it.

The Intel GPU variant test job has failed with MPI errors:

GERun: GErun command being run:
GERun:  mpirun --rsh=ssh -machinefile /tmpdir/job/1211349.undefined/machines.unique -np 16 -rr lmp_gpu -sf gpu -pk gpu 1 -in in.lj
Assertion failed in file ../../src/util/intel/shm_heap/impi_shm_heap.c at line 917: group_id < group_num
Assertion failed in file ../../src/util/intel/shm_heap/impi_shm_heap.c at line 917: group_id < group_num
Assertion failed in file ../../src/util/intel/shm_heap/impi_shm_heap.c at line 917: group_id < group_num
Assertion failed in file ../../src/util/intel/shm_heap/impi_shm_heap.c at line 917: group_id < group_num
Assertion failed in file ../../src/util/intel/shm_heap/impi_shm_heap.c at line 917: group_id < group_num
Assertion failed in file ../../src/util/intel/shm_heap/impi_shm_heap.c at line 917: group_id < group_num
Assertion failed in file ../../src/util/intel/shm_heap/impi_shm_heap.c at line 917: group_id < group_num
Assertion failed in file ../../src/util/intel/shm_heap/impi_shm_heap.c at line 917: group_id < group_num
Assertion failed in file ../../src/util/intel/shm_heap/impi_shm_heap.c at line 917: group_id < group_num
Assertion failed in file ../../src/util/intel/shm_heap/impi_shm_heap.c at line 917: group_id < group_num
Assertion failed in file ../../src/util/intel/shm_heap/impi_shm_heap.c at line 917: group_id < group_num
Assertion failed in file ../../src/util/intel/shm_heap/impi_shm_heap.c at line 917: group_id < group_num
Assertion failed in file ../../src/util/intel/shm_heap/impi_shm_heap.c at line 917: group_id < group_num
/shared/ucl/apps/intel/2020/impi/2019.6.166/intel64/lib/release/libmpi.so.12(MPL_backtrace_show+0x34) [0x2b286b6e31d4]
/shared/ucl/apps/intel/2020/impi/2019.6.166/intel64/lib/release/libmpi.so.12(MPIR_Assert_fail+0x21) [0x2b286ae6b031]
/shared/ucl/apps/intel/2020/impi/2019.6.166/intel64/lib/release/libmpi.so.12(+0x44c505) [0x2b286b1ac505]
/shared/ucl/apps/intel/2020/impi/2019.6.166/intel64/lib/release/libmpi.so.12(+0x7e9b0c) [0x2b286b549b0c]
/shared/ucl/apps/intel/2020/impi/2019.6.166/intel64/lib/release/libmpi.so.12(+0x64cd70) [0x2b286b3acd70]
/shared/ucl/apps/intel/2020/impi/2019.6.166/intel64/lib/release/libmpi.so.12(+0x1fe5fa) [0x2b286af5e5fa]
/shared/ucl/apps/intel/2020/impi/2019.6.166/intel64/lib/release/libmpi.so.12(+0x4664b4) [0x2b286b1c64b4]
/shared/ucl/apps/intel/2020/impi/2019.6.166/intel64/lib/release/libmpi.so.12(MPI_Init+0x11b) [0x2b286b1c1c7b]
lmp_gpu() [0x402622]

I'me beginning to build the non-GPU variants on Michael now:

  • basic Intel;
  • INTEL package variant;
  • GNU + FFTW variant.

All thats left to do now is add the missing variants from Myriad when the cluster is restored to service.