/Hybrid-Fortran

Accelerate with CUDA, OpenACC and OpenMP - Unified and Performance Portable

Primary LanguagePythonGNU Lesser General Public License v3.0LGPL-3.0

Hybrid Fortran - A Framework for GPU Acceleration

Join the chat at https://gitter.im/muellermichel/Hybrid-Fortran

What & Why

Problems With Accelerator Programming

The Idea Behind Hybrid Fortran

Quickstart

  1. Clone this repo and point the HF_DIR environment variable to its path.
  2. Type cd $HF_DIR && make example. This will create an example project directory below $HF_DIR. Note: You can move this wherever you want in order to use it as a template for your projects.
  3. Have a look at example/source/example.h90, which will guide you through and show you how to use Hybrid Fortran. ( Psst: It's better to open it in your editor of choice with Fortran syntax highlighting, but if you want to skip steps 1-2 you may also click the link. ;-) )

A Few More Words

Hybrid Fortran is ..

  • .. a directive based extension for the Fortran language.
  • .. a way for you to keep writing your Fortran code like you're used to - only now with GPGPU support.
  • .. a preprocessor for your code - its input are Fortran files (with Hybrid Fortran extensions), its output is CUDA Fortran or OpenMP Fortran code (or whatever else you'd like to have as a backend).
  • .. a build system that handles building two separate versions (CPU / GPU) of your codebase automatically, including all the preprocessing.
  • .. a test system that handles verification of your outputs automatically after setup.
  • .. a framework for you to build your own parallel code implementations (OpenCL, ARM, FPGA, Hamster Wheel.. as long as it has some parallel Fortran support you're good) while keeping the same source files.

Hybrid Fortran has been successfully used for porting the Physical Core of Japan's national next generation weather prediction model to GPGPU. We're currently planning to port the internationally used Open Source weather model WRF to Hybrid Fortran as well.

Hybrid Fortran has been developed since 2012 by Michel Müller, MSc ETH Zurich, as a guest at Prof. Aoki's Gordon Bell award winning laboratory at the Tokyo Institute of Technology, as well as during a temporary stay with Prof. Maruyama at the RIKEN Advanced Institute for Computational Science in Kobe, Japan.

Even More Words

The following Blog entry gives insight into why Hybrid Fortran has been created and how it can help you:

Accelerators in HPC – Having the Cake and Eating It Too

License

Hybrid Fortran is available under GNU Lesser Public License (LGPL).

Sample Results

Name Performance Results Speedup HF on 6 Core vs. 1 Core [1] Speedup HF on GPU vs 6 Core [1] Speedup HF on GPU vs 1 Core [1]
Japanese Physical Weather Prediction Core (121 Kernels) Slides Only
Slidecast
4.47x 3.63x 16.22x
3D Diffusion Link 1.06x
Compare Performance
10.94x
Compare Performance
Compare Speedup
11.66x
Particle Push Link 9.08x
Compare Performance
21.72x
Compare Performance
Compare Speedup
152.79x
Poisson on FEM Solver with Jacobi Approximation Link 1.41x 5.13x 7.28x
MIDACO Ant Colony Solver with MINLP Example Link 5.26x 10.07x 52.99x

CPU Sample Result (Diffusion)

GPU Sample Result (Diffusion)

Code Example

The following sample code shows a wrapper subroutine and an add subroutine. Please note that storage order inefficiencies are ignored in this example (this would create an implicit copy of the z-dimension in arrays a, b, c).

module example
contains
subroutine wrapper(a, b, c)
    real, intent(in), dimension(NX, NY, NZ) :: a, b
    real, intent(out), dimension(NX, NY, NZ) :: c
    integer(4) :: x, y
    do y=1,NY
      do x=1,NX
        call add(a(x,y,:), b(x,y,:), c(x,y,:))
      end do
    end do
end subroutine

subroutine add(a, b, c)
    real, intent(in), dimension(NZ) :: a, b, c
    integer :: z
    do z=1,NZ
        c(z) = a(z) + b(z)
    end do
end subroutine
end module example

Here's what this code would look like in Hybrid Fortran, parallelizing the x and y dimensions on both CPU and GPU.

module example contains
subroutine wrapper(a, b, c)
  real, dimension(NZ), intent(in) :: a, b
  real, dimension(NZ), intent(out) :: c
  @domainDependant{domName(x,y), domSize(NX,NY), attribute(autoDom)}
  a, b, c
  @end domainDependant
  @parallelRegion{appliesTo(CPU), domName(x,y), domSize(NX, NY)}
  call add(a, b, c)
  call mult(a, b, c)
  @end parallelRegion
end subroutine

subroutine add(a, b, c)
  real, dimension(NZ), intent(in) :: a, b
  real, dimension(NZ), intent(out) :: c
  integer :: z
  @domainDependant{domName(x,y), domSize(NX,NY), attribute(autoDom)}
  a, b, c
  @end domainDependant
  @parallelRegion{appliesTo(GPU), domName(x,y), domSize(NX, NY)}
  do z=1,NZ
      c(z) = a(z) + b(z)
  end do
  @end parallelRegion
end subroutine

subroutine mult(a, b, c)
  !...
end subroutine

end module example

Please note the following:

  • The x and y dimensions have been abstracted away, even in the wrapper. We don't need to privatize the add subroutine in x and y as we would need to in CUDA or OpenACC. The actual computational code in the add subroutine has been left untouched.

  • We now have two parallelizations: For the CPU the program is parallelized at the wrapper level using OpenMP. For GPU the program is parallelized using CUDA Fortran at the callgraph leaf level. In between the two can be an arbitrarily deep callgraph, containing arbitrarily many parallel regions (with some restrictions, see below). The data copying from and to the device as well as the privatization in 3D is all handled by the Hybrid Fortran preprocessor framework.

Documentation

Published Materials