/CloverLeaf

CloverLeaf, in many programming models

Primary LanguageC++

CloverLeaf

WIP

CloverLeaf implementation in a wide range of parallel programming models. This implementation has support for building with and without MPI. When MPI is enabled, all models will adjust accordingly for asynchronous MPI send/recv.

This is a consolidation of the following independent ports with a shared driver and working MPI paths:

Programming Models

CloverLeaf is currently implemented in the following parallel programming models, listed in no particular order:

  • CUDA
  • HIP
  • OpenMP 3 and 4.5
  • C++ Parallel STL (StdPar)
  • Kokkos >= 4
  • SYCL and SYCL 2020

Planned:

  • OpenACC
  • RAJA
  • TBB
  • Thrust (via CUDA or HIP)

Building

Drivers, compiler and software applicable to whichever implementation you would like to build against is required.

CMake

The project supports building with CMake >= 3.13.0, which can be installed without root via the official script.

Each implementation (programming model) is built as follows:

$ cd CloverLeaf

# configure the build, build type defaults to Release
# The -DMODEL flag is required
$ cmake -Bbuild -H. -DMODEL=<model> -DENABLE_MPI=ON <model specific flags prefixed with -D...>

# compile
$ cmake --build build

# run executables in ./build
$ ./build/<model>-cloverleaf

The MODEL option selects one implementation of CloverLeaf to build. The source for each model's implementations are located in ./src/<model>.

Running

CloverLeaf supports the following options:

Usage: --help [OPTIONS]

Options:
  -h  --help                             Print this message
      --list                             List available devices with index and exit
      --device           <INDEX|NAME>    Use device at INDEX from output of --list or substring match iff INDEX is not an id
      --file,--in              <FILE>    Custom clover.in file FILE (defaults to clover.in if unspecified)
      --out                    <FILE>    Custom clover.out file FILE (defaults to clover.out if unspecified)
      --dump                    <DIR>    Dumps all field data in ASCII to ./DIR for debugging, DIR is created if missing
      --profile                          Enables kernel profiling, this takes precedence over the profiler_on in clover.in
      --staging-buffer <true|false|auto> If true, use a host staging buffer for device-host MPI halo exchange.
                                         If false, use device pointers directly for MPI halo exchange.
                                         Defaults to auto which elides the buffer if a device-aware (i.e CUDA-aware) is used.
                                         This option is no-op for CPU-only models.
                                         Setting this to false on an MPI that is not device-aware may cause a segfault.


For example

The output on stdout is machine-readable in YAML format where the Output key contains CloverLeaf 1.3's output format. For example, here's the output of mpirun -np 3 kokkos_cloverleaf --device 0 --file InputDecks/clover_bm_short.in --profile true:

---
Devices:
  0: N6Kokkos4CudaE
CloverLeaf:
  - Ver.: 2.000
  - Deck: InputDecks/clover_bm_short.in
  - Out: clover.out
  - Profiler: true
MPI:
  - Enabled: true
  - Total ranks: 3
  - Header device-awareness (CUDA-awareness): true
  - Runtime device-awareness (CUDA-awareness): true
  - Host-Device halo exchange staging buffer: false
Model:
  - Name: Kokkos 4.0.1
  - Execution: Offload (device)
  - Backend space: N6Kokkos4CudaE
  - Backend host space: N6Kokkos6SerialE
# ---- 
Output: |+1
 Output file clover.out opened. All output will go there.
 Args: --device 0 --file InputDecks/clover_bm_short.in --profile true
 Using input: `InputDecks/clover_bm_short.in`
 Problem initialised and generated
 Launching hydro
 Step 1 time 0 control sound timestep  0.00616258 1,1 x 0 y 0
 Wall clock 0.0259612
 ...... 
 Step 86 time 0.491277 control sound timestep  0.00584781 1,1 x 0 y 0
 Wall clock 1.42524
 Average time per cell 1.79824e-08
  Step time per cell    1.69889e-08
 Step 87 time 0.497124 control sound timestep  0.005848 1,1 x 0 y 0
 Test problem 2 is within 1.17018e-11% of the expected solution
 This test is considered PASSED
 Wall clock 1.44286
 First step overhead 0

 Profiler Output        Time     Percentage
 Timestep              :0.110086 7.629754
 Ideal Gas             :0.000370 0.025662
 Viscosity             :0.001094 0.075812
 PdV                   :0.058765 4.072801
 Revert                :0.000815 0.056463
 Acceleration          :0.001175 0.081414
 Fluxes                :0.001452 0.100665
 Cell Advection        :0.001999 0.138538
 Momentum Advection    :0.003294 0.228296
 Reset                 :0.002566 0.177848
 Summary               :0.014976 1.037959
 Visit                 :0.000000 0.000000
 Tile Halo Exchange    :0.000016 0.001107
 Self Halo Exchange    :0.009350 0.648008
 MPI Halo Exchange     :1.236754 85.715627
 Total                 :1.442712 99.989953
 The Rest              :0.000145 0.010047

Result:
  - Problem: 2
  - Outcome: PASSED