GAMESS RI-MP2 mini-app

General Atomic and Molecular Electronic Structure System (GAMESS) is a popular quantum chemistry software package which has been around since the 1980s. It can calculate a wide variety of molecular properties using electronic structure methods. One of the methods implemented in GAMESS is resolution of identity Moller-Plesset perturbation (RI-MP2) theory. RI-MP2 is an electron correlation method, which is a class of methods that include instantaneous electron-electron interactions, and are required to perform accurate energy and property calcula- tions for certain classes of molecular systems. Of the electron correlation methods, RI-MP2 tends to be one of the more computationally inexpensive methods, but the formal computational complexity is still O(N5), where N is a measure of system size.

The GAMESS RI-MP2 mini-app computes the correlation energy with the Hartree-Fock energy and wave-function given as inputs. The inputs were generated from GAMESS.

Inputs for GAMESS RI-MP2 mini-app

Input data sets from GAMESS

Input data sets for GAMESS RI-MP2 mini-app include several fundamental parameters (e.g., the number of atomic orbital (N) and auxiliary (X) basis functions, the number of correlated occupied (O) and virtual (V) molecular orbitals), the molecular orbital coefficients, the molecular orbital energies, and 3-index integral matrix B(X,V,O)), and the calculated MP2 correlation energy for validation. The following input data sets were generated from GAMESS:

  • benz.kern for Benzene
  • cor.kern for Coronene
  • c60.kern for Fullerene
  • w30.kern for 30 water clusters
  • w60.kern for 60 water clusters

In this git repository, there is only one input file, benz.kern. It is the smallest input. You can find bigger inputs at the following link:
On NVIDIA V100 GPUs, we recommend to use c60.kern, w30.kern, or w60.kern inputs to see meaningful speedups.

cor.kern c60.kern w60.kern

Arbitrary data sets with the same data structures as the above inputs

The above data sets require significant I/O times before computing the correlation energy. They are not necessary in actual GAMESS workloads. In order to avoid the unnecessary I/O time, the following arbitrary data sets can be generated via the initialization process:

  • benz.rand: an arbitrary data with the same data structure of benz.kern
  • cor.rand: an arbitrary data with the same data structure of cor.kern
  • c60.rand: an arbitrary data with the same data structure of c60.kern
  • w30.rand: an arbitrary data with the same data structure of w30.kern
  • w60.rand: an arbitrary data with the same data structure of w60.kern

Running GAMESS RI-MP2 mini-app


Build the executables

rimp2-cublas:   rimp2 with OpenMP offloading + cublas on GPU,
rimp2-cublasxt: rimp2 with OpenMP offloading + cublasxt on GPU,
rimp2-nvblas:   rimp2 with OpenMP offloading + nvblas on GPU,
rimp2-essl:     rimp2 with OpenMP threading  + ESSL on CPU, and
rimp2-serial:   rimp2 with a single thread   + ESSL on CPU.

$ source source_me_OLCF
$ make clean
$ make all

Run via an interactive job:

$ bsub -P <your project code> -nnodes 1 -W 120 -Is /bin/bash
$ source source_me_OLCF
$ NMPI=x INPUT=xxx EXEC='rimp2-xxx rimp2-yyy' ./             
$ NMPI=x NTHREAD=x INPUT=xxx EXEC='rimp2-zzz' ./
    # NMPI is the number of MPIs. If it doesn't exist, NMPI is set to 1.
    # NTHREAD is the number of OpenMP threads per MPI. If it doesn't exist, NTHREAD is set to min(42, 42*NNODES/NMPI).
    # INPUT is the input name. If it doesn't exist, INPUT is set to benz.kern.
    # EXEC is the executable name(s). If it doesn't exist. EXEC is set to 'rimp2-cublas rimp2-cublasxt rimp2-nvblas' for, and 'rimp2-essl rimp2-serial' for

Run via a batch job:

$ bsub            
    # This example runs and (only with rimp2-essl) with two inputs (cor.kern, and c60.kern)
    #     on 4 SUMMIT nodes with 1, 2, 4, 6, 12, and 24 MPI ranks ( 1 GPU/MPI, 7 CPU threads/MPI).
    # You may modify this example script for your own tests.

JLSE Skylake nodes at ALCF

Build the executables

rimp2-mkl:      rimp2 with OpenMP threading + MKL on CPU

$ source source_me_JLSE_Intel
$ make clean
$ make all

Run via an interactive job:

$ qsub -I -n 1 -t 120 -q skylake_8180
$ source source_me_JLSE_Intel
$ NMPI=x NTHREAD=x INPUT=xxx EXEC='rimp2-zzz' ./
    # NMPI is the number of MPIs. If it doesn't exist, NMPI is set to 1.
    # NTHREAD is the number of OpenMP threads per MPI. If it doesn't exist, NTHREAD is set to min(56, 56*NNODES/NMPI).
    # INPUT is the input name. If it doesn't exist, INPUT is set to benz.kern.
    # EXEC is the executable name(s). If it doesn't exist. EXEC is set to 'rimp2-mkl' for

Run via a batch job:

$ qsub ./            
    # This example runs with two inputs (cor.kern, and c60.kern)
    #     on 1 Skylake 8180 node with 1, 2, and 4 MPI ranks ( 56 threads in total).
    # You may modify this example script for your own tests.

Figure-of-Merit (FOM)

The Figure-of-Merit (FOM) is the time-to-solution of the input. The mini-app reports three walltimes from MPI ranks: minimum, mean, and maximum. The time-to-solution (TTS) is the maximum wall time. For the baseline benchmark, use w60.kern. The reference data on SUMMIT are as follows:

Time-to-Solution in second: maximum wall time and FOM

  • 1 GPU/MPI for rimp2-cublas & rimp2-cublasxt
  • 7 threads/MPI for rimp2-essl
w30.kern w60.kern
Number of Nodes NMPI rimp2-cublas rimp2-cublasxt rimp2-essl rimp2-cublas rimp2-cublasxt rimp2-essl
1 1 2.899 2.314 86.324 87.301 72.903 2727.419
1 2 1.582 1.287 43.848 44.646 37.512 1386.807
1 4 0.899 0.759 26.768 23.181 19.67 792.305
1 6 0.664 0.643 19.333 16.08 14.074 563.707
2 12 0.447 0.397 12.626 8.845 7.999 317.892
4 24 0.402 0.379 9.358 5.383 4.921 212.748
8 48 0.337 0.308 9.347 3.722 3.687 154.441
16 96 0.358 0.332 9.349 2.923 3.169 150.704

Examples of runs

A quick test for build and validation

In order to build all executables and test them with benz.kern, you may run ./ The following is an example on SUMMIT:

bash-4.2$ . source_me_OLCF 
bash-4.2$ ./
rm -rf *.o *.mod rimp2-nvblas rimp2-cublas rimp2-cublasxt rimp2-essl rimp2-serial
mpifort -qsmp=omp -qoffload -qsuffix=cpp=f90 -DNVBLAS -g rimp2_energy_whole_KERN.f90 -o rimp2-nvblas -lnvblas -L/sw/summit/essl/6.1.0-2/essl/6.1/lib64 -lessl
** rimp2_shared   === End of Compilation 1 ===
** rimp2_input   === End of Compilation 2 ===
** mp2correng   === End of Compilation 3 ===
** rimp2_trape_dec   === End of Compilation 4 ===
** rimp2_energy_whole   === End of Compilation 5 ===
** rimp2_energyij   === End of Compilation 6 ===
** initialization   === End of Compilation 7 ===
** read_input_file   === End of Compilation 8 ===
1501-510  Compilation successful for file rimp2_energy_whole_KERN.f90.
rm -rf *.o *.mod
mpifort -qsmp=omp -qoffload -qsuffix=cpp=f90 -DCUBLAS -g -c cublasf.f90
** cublasf   === End of Compilation 1 ===
1501-510  Compilation successful for file cublasf.f90.
mpifort -qsmp=omp -qoffload -qsuffix=cpp=f90 -DCUBLAS -g rimp2_energy_whole_KERN.f90 -o rimp2-cublas -lcublas cublasf.o
** rimp2_shared   === End of Compilation 1 ===
** rimp2_input   === End of Compilation 2 ===
** mp2correng   === End of Compilation 3 ===
** rimp2_trape_dec   === End of Compilation 4 ===
** rimp2_energy_whole   === End of Compilation 5 ===
** rimp2_energyij   === End of Compilation 6 ===
** initialization   === End of Compilation 7 ===
** read_input_file   === End of Compilation 8 ===
1501-510  Compilation successful for file rimp2_energy_whole_KERN.f90.
rm -rf *.o *.mod
mpifort -qsmp=omp -qoffload -qsuffix=cpp=f90 -DCUBLASXT -g -c cublasf.f90
** cublasf   === End of Compilation 1 ===
1501-510  Compilation successful for file cublasf.f90.
mpifort -qsmp=omp -qoffload -qsuffix=cpp=f90 -DCUBLASXT -g rimp2_energy_whole_KERN.f90 -o rimp2-cublasxt -lcublas cublasf.o
** rimp2_shared   === End of Compilation 1 ===
** rimp2_input   === End of Compilation 2 ===
** mp2correng   === End of Compilation 3 ===
** rimp2_trape_dec   === End of Compilation 4 ===
** rimp2_energy_whole   === End of Compilation 5 ===
** rimp2_energyij   === End of Compilation 6 ===
** initialization   === End of Compilation 7 ===
** read_input_file   === End of Compilation 8 ===
1501-510  Compilation successful for file rimp2_energy_whole_KERN.f90.
rm -rf *.o *.mod
mpifort -qsmp=omp -qsuffix=cpp=f90 -DCPU -g  rimp2_energy_whole_KERN.f90 -o rimp2-essl -L/sw/summit/essl/6.1.0-2/essl/6.1/lib64 -lessl
** rimp2_shared   === End of Compilation 1 ===
** rimp2_input   === End of Compilation 2 ===
** mp2correng   === End of Compilation 3 ===
** rimp2_trape_dec   === End of Compilation 4 ===
** rimp2_energy_whole   === End of Compilation 5 ===
** rimp2_energyij   === End of Compilation 6 ===
** initialization   === End of Compilation 7 ===
** read_input_file   === End of Compilation 8 ===
1501-510  Compilation successful for file rimp2_energy_whole_KERN.f90.
rm -rf *.o *.mod
mpifort -qsmp=omp -qsuffix=cpp=f90 -g rimp2_energy_whole_KERN.f90 -o rimp2-serial -L/sw/summit/essl/6.1.0-2/essl/6.1/lib64 -lessl
** rimp2_shared   === End of Compilation 1 ===
** rimp2_input   === End of Compilation 2 ===
** mp2correng   === End of Compilation 3 ===
** rimp2_trape_dec   === End of Compilation 4 ===
** rimp2_energy_whole   === End of Compilation 5 ===
** rimp2_energyij   === End of Compilation 6 ===
** initialization   === End of Compilation 7 ===
** read_input_file   === End of Compilation 8 ===
1501-510  Compilation successful for file rimp2_energy_whole_KERN.f90.
rm -rf *.o *.mod
Running this script with 1 node(s) with up to 6 GPUs:
   NMPI is set to 2.
   INPUT is set to benz.kern. For another INPUT, use INPUT=xxxx before this job script.
   EXEC is set to rimp2-cublasxt rimp2-cublas rimp2-nvblas. For another EXEC, use EXEC='x y' before this job script.

[[[Running rimp2-cublasxt with 2 MPI rank(s)...]]]
 You are running the code with cublasxt on GPU
     Reading data from benz.kern                                                                       
     NAUXBASD,NCOR,NACT,NVIR,NBF =    420     6    15    93   120
     NQVV =        15
     Memory Footprint:
          B32(   39060,     15) =      4.6872 MB
          eij(      15,     15) =      0.0018 MB
          eab(      93,     93) =      0.0692 MB
          QVV(   93,  15,   93) =      1.0379 MB

                               Number of MPI ranks   =     2
                               Number of OMP threads =     1
             Rel. error of computed MP2 corr. energy =  0.28444E-15
                               Wall time (minimum)   =    0.002 sec
                               Wall time (mean)      =    0.003 sec
                               Wall time (maximum)   =    0.004 sec
          Passed :-) 

[[[Running rimp2-cublas with 2 MPI rank(s)...]]]
 You are running the code with cublas on GPU
     Reading data from benz.kern                                                                       
     NAUXBASD,NCOR,NACT,NVIR,NBF =    420     6    15    93   120
     NQVV =        15
     Memory Footprint:
          B32(   39060,     15) =      4.6872 MB
          eij(      15,     15) =      0.0018 MB
          eab(      93,     93) =      0.0692 MB
          QVV(   93,  15,   93) =      1.0379 MB

                               Number of MPI ranks   =     2
                               Number of OMP threads =     1
             Rel. error of computed MP2 corr. energy =  0.28444E-15
                               Wall time (minimum)   =    0.001 sec
                               Wall time (mean)      =    0.002 sec
                               Wall time (maximum)   =    0.002 sec
          Passed :-) 

[[[Running rimp2-nvblas with 2 MPI rank(s)...]]]
[NVBLAS] NVBLAS_CONFIG_FILE environment variable is NOT set : relying on default config filename 'nvblas.conf'
[NVBLAS] Cannot Log File 'nvblas.log'
[NVBLAS] Using devices :0 
[NVBLAS] Config parsed
[NVBLAS] NVBLAS_CONFIG_FILE environment variable is NOT set : relying on default config filename 'nvblas.conf'
[NVBLAS] Cannot Log File 'nvblas.log'
[NVBLAS] Using devices :0 
[NVBLAS] Config parsed
 You are running the code with nvblas on GPU
     Reading data from benz.kern                                                                       
     NAUXBASD,NCOR,NACT,NVIR,NBF =    420     6    15    93   120
     NQVV =        15
     Memory Footprint:
          B32(   39060,     15) =      4.6872 MB
          eij(      15,     15) =      0.0018 MB
          eab(      93,     93) =      0.0692 MB
          QVV(   93,  15,   93) =      1.0379 MB

                               Number of MPI ranks   =     2
                               Number of OMP threads =     1
             Rel. error of computed MP2 corr. energy =  0.28444E-15
                               Wall time (minimum)   =    0.002 sec
                               Wall time (mean)      =    0.003 sec
                               Wall time (maximum)   =    0.004 sec
          Passed :-) 

Running this script with 1 node(s) with up to 42 CPU threads in total:
   NMPI is set to 2.
   NTHREAD is set to 21. For another NTHREAD, use NTHREAD=x before this job script.
   INPUT is set to benz.kern. For another INPUT, use INPUT=xxxx before this job script.
   EXEC is set to rimp2-essl rimp2-serial. For another EXEC, use EXEC='x y' before this job script.

[[[Running rimp2-essl with 2 MPI rank(s) and 21 threads/MPI ...]]]
 You are running the code with CPU OpenMP
     Reading data from benz.kern                                                                       
     NAUXBASD,NCOR,NACT,NVIR,NBF =    420     6    15    93   120
     NQVV =        15
     Memory Footprint:
          B32(   39060,     15) =      4.6872 MB
          eij(      15,     15) =      0.0018 MB
          eab(      93,     93) =      0.0692 MB
          QVV(   93,  15,   93) =      1.0379 MB

                               Number of MPI ranks   =     2
                               Number of OMP threads =    21
             Rel. error of computed MP2 corr. energy =  0.00000E+00
                               Wall time (minimum)   =    0.004 sec
                               Wall time (mean)      =    0.005 sec
                               Wall time (maximum)   =    0.006 sec
          Passed :-) 

[[[Running rimp2-serial with 2 MPI rank(s) and 21 threads/MPI ...]]]
 You are running the code serially
     Reading data from benz.kern                                                                       
     NAUXBASD,NCOR,NACT,NVIR,NBF =    420     6    15    93   120
     NQVV =        15
     Memory Footprint:
          B32(   39060,     15) =      4.6872 MB
          eij(      15,     15) =      0.0018 MB
          eab(      93,     93) =      0.0692 MB
          QVV(   93,  15,   93) =      1.0379 MB

                               Number of MPI ranks   =     2
                               Number of OMP threads =    21
             Rel. error of computed MP2 corr. energy =  0.00000E+00
                               Wall time (minimum)   =    0.021 sec
                               Wall time (mean)      =    0.023 sec
                               Wall time (maximum)   =    0.025 sec
          Passed :-) 


Runtime comparison of w30.kern and w30.rand

Mini-app runtimes with real data sets (ending with .kern) include significant file I/O times, while runtimes with generated data sets (ending with .rand) have minimal I/O times. The following examples show similar wall times for computing the correlation energy, but the runtimes measured by time command are quite different.

Input Total Runtime FOM (Max wall time)
w30.kern 63.727s 1.328s
w30.rand 5.069s 1.310s

w30.kern with NMPI=2 on SUMMIT

bash-4.2$ time NMPI=2 INPUT=w30.kern EXEC=rimp2-cublasxt ./
Running this script with 1 node(s) with up to 6 GPUs:
   NMPI is set to 2.
   INPUT is set to w30.kern.
   EXEC is set to rimp2-cublasxt.

[[[Running rimp2-cublasxt with 2 MPI rank(s)...]]]
 You are running the code with cublasxt on GPU
     Reading data from w30.kern                                                                        
     NAUXBASD,NCOR,NACT,NVIR,NBF =   2520    30   120   570   750
     NQVV =       120
     Memory Footprint:
          B32( 1436400,    120) =   1378.9440 MB
          eij(     120,    120) =      0.1152 MB
          eab(     570,    570) =      2.5992 MB
          QVV(  570, 120,  570) =    311.9040 MB

                               Number of MPI ranks   =     2
                               Number of OMP threads =     1
             Rel. error of computed MP2 corr. energy =  0.00000E+00
                               Wall time (minimum)   =    1.182 sec
                               Wall time (mean)      =    1.255 sec
                               Wall time (maximum)   =    1.328 sec
          Passed :-) 

real    1m3.727s
user    0m1.104s
sys     0m0.250s

w30.rand with NMPI=2 on SUMMIT

bash-4.2$ time NMPI=2 INPUT=w30.rand EXEC=rimp2-cublasxt ./
Running this script with 1 node(s) with up to 6 GPUs:
   NMPI is set to 2.
   INPUT is set to w30.rand.
   EXEC is set to rimp2-cublasxt.

[[[Running rimp2-cublasxt with 2 MPI rank(s)...]]]
 You are running the code with cublasxt on GPU
     Generating arbitrary input data with the structure of w30.kern
     NAUXBASD,NCOR,NACT,NVIR,NBF =   2520    30   120   570   750
     NQVV =       120
     Memory Footprint:
          B32( 1436400,    120) =   1378.9440 MB
          eij(     120,    120) =      0.1152 MB
          eab(     570,    570) =      2.5992 MB
          QVV(  570, 120,  570) =    311.9040 MB

                               Number of MPI ranks   =     2
                               Number of OMP threads =     1
             Rel. error of computed MP2 corr. energy =  0.15987E-14
                               Wall time (minimum)   =    1.136 sec
                               Wall time (mean)      =    1.223 sec
                               Wall time (maximum)   =    1.310 sec
          Passed :-) 

real    0m5.069s
user    0m1.128s
sys     0m0.225s