(source)
- thread-MPI (enabled by default in GROMACS): use native, hardware threads on a single node but on a single node, more efficiently.
- real MPI and thread-MPI from the user perspective looks almost the same
- GROMACS refers MPI ranks to mean either kind (MPI & thread-MPI)
- external MPI runs more slowly than thread-MPI
- diagnostics: at runtime, it will put in log/stdout/stderr to inform the user about their choices and consequences
OMP_NUM_THREADS
: number of OpenMP threads, we control thismpirun
before./gmx
: will use external MPI
-nt
: number of threads (whether thread-MPI ranks or OpenMP threads within ranks depends on other settings)-ntmpi
: number of thread-MPI ranks to use. Default is one rank per core.-ntomp
: number of OpenMP threads per rank (honorsOMP_NUM_THREADS
), max 64.-npme
: number of ranks to dedicate to long-ranged component of PME. Keep to 2?-ntomp_pme
: number of separate PME ranks, default copies from-ntomp
.-pin
: attempt to set affinity of threads to cores. Keep to "on".-nb
: if no GPU, set to "cpu".
- MPI might be more performant than OpenMP due to less memory contension.
- PME (Particle-mesh Ewald) ranks, when separate, seems to give better performance.
(source)
tar xfz gromacs-2023.3.tgz
cd gromacs-2023.3
mkdir build-gromacs # or whatever directory, but should be a subdirectory
cd build-gromacs
cmake ..
- minimum GNU gcc 9.
- pass multiple options at once.
- use
ccmake
aftercmake ..
returns to see all the settings that were chosen. - most options have
CMAKE_
andGMX_
prefixes.
- Help GROMACS find right libraries and binaries external to it.
-DCMAKE_INCLUDE_PATH
for header files-DCMAKE_LIBRARY_PATH
for libraries-DCMAKE_PREFIX_PATH
for header, libraries and binaries (e.g./usr/local
). For example,which hwlock
and stick that in. --DCMAKE_INSTALL_PREFIX
: path where GROMAC installs, and places headers, binaries and libraries. This is the root of the GROMACS installation.
-DGMX_DOUBLE=off
: turn off double precision as it's slower.- MPI:
-DGMX_MPI=ON
(this binary isgmx_mpi
) - FFT library
-DGMX_BUILD_OWN_FFTW=ON
to let GROMACS build FFTW from source (this is good enough)-DGMX_FFT_LIBRARY=<your library like fftw3> -DFFTWF_LIBRARY=<path to library>
- If
hwloc
is installed:-DGMX_HWLOC=ON
(to improve runtime detection of hw capabilities). - SIMD: if in doubt, choose the lowest number you think might work and see what mdrun says (highest value leads to performance loss on processors like Skylake and 1st-gen Zen).
- run
lscpu
and look for "Flags". - set with
-DGMX-SIMD=<value>
- run
- BLAS:
-DGMX_BLAS_USER=<path to your BLAS>
- LAPACK:
-DGMX_LAPACK_USER=<path to your LAPACK>
With multi-core run with
make -j <Number of processors>
make check
make install
Source their script GMXRC
before running it! (Stick this in SLURM)
source /your/installation/prefix/here/bin/GMXRC
- Get the latest version of your C and C++ compilers.
- Check that you have CMake version 3.18.4 or later.
- Get and unpack the latest version of the GROMACS tarball.
- Make a separate build directory and change to it.
- Run cmake with the path to the source as an argument
- Run make, make check, and make install
- Source GMXRC to get access to GROMACS
- MDS – Manages filenames and directories, file stripe locations, locking, ACLs, etc.
- MDT – Block device used by MDS to store metadata information
- OSS – Handles I/O requests for file data
- OST – Block device used by OSS to store file data. Each OSS usually serves multiple OSTs.
- MGS – Management server. Stores configuration information for one or more Lustre file systems.
- MGT - Block device used by MGS for data storage
- LNET - (Lustre Networking) provides the underlying communication infrastructure
Types of benchmarking
- IOZONE: i/o speed
- choose big number to read/write
- read > write
- care about the time it takes to read/write.vegetarian
- HPL: see here
- IPM:
- STREAM: memory
- same as IOZONE but uses memory instead
- bandwidth vs. size affected by cache (L1, L2, L3, RAM)
- L1 is the best so we want as much in L1. L1 has split instruction and data cache.
- L2, L3, and RAM are shared, which can be bottlenecked.
Accessible here: https://www.netlib.org/benchmark/hpl
Download the tar file onto machine and uncompress:
wget https://www.netlib.org/benchmark/hpl/hpl-2.3.tar.gz
tar -xf hpl-2.3.tar.gz
Builds easily. Load compiler, MPI library, BLAS library.
- Use
module avail
to see available modules. - Use
module load
to load one.
Make a build
directory somewhere, could be in the uncompressed hpl
directory.
mkdir build
pwd # gives the absolute path
./configure -prefix /absolute/path/to/build
make
make install
Your executable will be in the bin
folder of the build
dir. It will be called xhpl
.
You need to add a HPL.dat
file if it doesn’t already exist.
touch HPL.dat
Regardless of whether it does, use this tool to generate one: https://www.advancedclustering.com/act_kb/tune-hpl-dat-file/
Inside HPL.dat
:
N
should be> 100,000
NB
should be192
General rule for p
and q
:
-
P * Q = number of ranks
-
Make sure they're as square as possible
-
q > p
-
#N
declares the number ofN
values. Set it to 1.
Things to consider:
To check whether its working, set N
to something low (10,000 or less) and the run should be really fast (but probably not a good result).
Turn this into a slurm script and increase N
to what the website suggests (and maybe even a little bit higher!)
You can also try changing out compiler/mpi/blas library. This will require you to reconfigure and rebuild.
Play around with different number of nodes/cores. Remember to change the mpirun command as well as p
and q
To run using MPI, you'll need to do something that looks like the following:
mpirun -np <CORES> ./xhpl # CORES = TOTAL cores used, so 3 nodes of 64 is 192
Adding OpenMP to this:
- The environment variable
OMP_NUM_THREADS
is the number of OMP threads per rank.
Altogether
OMP_NUM_THREADS=<THREADS PER RANK> mpirun -np <TOTAL CORES> ./xhpl
With multiple physical cores per node, add these additional flags:
--bind-to socket --map-by socket
If in doubt, use OMP_NUM_THREADS=1
.
- 1 core per rank, and set
p
andq
so thatp * q
to be equal to the total number of cores.
Find the peak FLOPS:
- This is given by
NUMBER OF NODES * NUMBER OF CORE PER NODE * CLOCK SPEED (GHz) * IPC
- Will give value in GFLOPS.
- Good FLOPS: roughly 65% of peak FLOPS.
To get information about CPU, use lscpu
:
- Base clock speed, not boost clock speed
To diagnose, use top
to check load and CPU usage:
- To sort: hit
f
, select with arrow keys the desired value to sort by, hits
and<esc>
.
Filesystem:
- Should be NFS (shared)
Scheduler:
- use
slurm
. See below.
MPI libraries:
- OpenMPI
- IntelMPI
- MPICH
Compilers:
- gcc
- Clang
- LLVM
- Intel (good with IntelMPI)
Networking:
- Needs to have low-latency and high-bandwidth
- Ethernet (1-10 Gb)
- Infiniband (better than ethernet for performance)
CPU vs. GPU:
- CPU cheaper and easier to optimize.
- GPU harder to configure.
The result you're looking for is the value towards the bottom that says GFLOPs. That's pretty much the only value we care about
Be aware of nodes
vs. ntasks
Documentation: https://slurm.schedmd.com/sbatch.html
# submit.sbat or
#!/bin/bash
#SBATCH --time=0-0:01:00 # d-hh:mm:ss
#SBATCH --nodes=1 # NUMBER OF NODES
#SBATCH --ntasks=4 # TOTAL CORES
#SBATCH --account=BLAH # probably given
#SBATCH --chdir=BLAH # probably share a parent dir with everything in there
#SBATCH --job-name=BLAH # hpl_bristol something there
#SBATCH --output=BLAH # both stdout and stderr in here
# Rest of script here
CLI things to note:
squeue -u <USERNAME>
Submit scripts:
sbatch /path/to/script