FLASH Introduction

As more accelerator platforms become increasingly domain-specific and widely available, HPC developers are faced with the problem of perpetual refactoring of application codes in order to continue making ground-breaking science practical. Code refactoring often requires a large amount of time and expertise on behalf of an HPC developers to understand the new accelerator hardware, runtime interfaces, and the application architecture. As a result, frameworks for improving code portability have taken the center stage as a mean to reduce the amount of code refactoring effort necessary to facilitate the adoption of new hardware accelerators in HPC. Moreover, with the performance of general-purpose computing quickly plateauing, HPC software solutions must look toward more application- and domain-specific accelerators (ASAs/DSAs) to reach the next notable milestones in performance. This will undoubtedly increase the cadence of code refactoring to an impractical level with existing solutions, and more so, without.

FLASH 1.0 is a software framework forrapid parallel deployment and enhancing host code portability inheterogeneous computing. FLASH 1.0 is a C++-based frameworkthat critically serves as a clear way-point for separating hardware-agnostic and hardware-specific logic to facilitate code portabil-ity. FLASH 1.0 uses variadic templates andmixinidioms based interfaces as the primary vehicle to enable a simple, extensiblehardware-agnostic interfaces easily supportable by legacy and future accelerators of various architectures. FLASH 1.0 consists of four major components the frontend (host and kernel)and backend interfaces and the frontend and backend runtimes to enable extensibility and portability.

Building shared object (.so) for default (CPU) Backends

1. make OR make all                                                              #enables CPU, CUDA*, FPGA** backends by DEFAULT

*  Dependent on having OpenCL for Intel FPGA installed
** Dependent on having CUDA Toolkit installed

The output .so will be placed in : ./build/lib64 and headers will be in ./build/include

Building shared object with specific backends

valid backends are : cpu_runtime, cuda_runtime, opencl_runtime (Intel FPGA-only)

1. make FLASH_VARIANT=[backend[,...]]            

ex. make FLASH_VARIANT=cpu_runtime,cuda_runtime                                  #enables CPU and CUDA backends

Test builds notes

Building applications require the use of c++20 features, however NVCC doesn't support it yet. The object file(s) that contains Flash logic must be compiled into objects via a C++ 20 enabled compiler. The kernels need to be compiled with NVCC.

Ex. Building main with C++20

g++ -c cuda_main.cc -o cuda_main.o -I./build/include -L./build/lib64 -lflash_wrapper -lcuda -std=c++2a

Building CUDA kernels

nvcc -arch=sm_50 -c cuda_kernels.cu -o cuda_kernels.o

Linking main with cuda object files:

nvcc -arch=sm_50 cuda_kernels.o cuda_main.o -o host.bin -I./build/include -L./build/lib64 -lflash_wrapper -lcuda

Building CPU unit test

CPU kernels in the form of free functions or member functions must be built with -rdynamic, -ldl and -lpthread

Member functions must be attributed with [[gnu::used]] if thier only invokation is via the flash runtime else disregard the attribute. Free functions do not require the function attribute.

Ex. of member function attribute

  struct TEST 
  {
    [[gnu::used]]
    void hello_world(){}    
    int i=0;
  };

The CPU runtime engine uses a single method for indicating which work items is currently being executed. It is an Nth dimensional indexing system driven by the "defer" or "exec" interface.

Ex. defer(dim1, dim2, dim3,..., dimN) or exec(dim1, dim2, dim3,..., dimN) #dims[N] are of type size_t

defer( 3, 3, 3) with N=3 creates 27 total work items
1. {0, 0, 0}  10. {0, 0, 1}  19. {0, 0, 2}
2. {1, 0, 0}  11. {1, 0, 1}  20. {1, 0, 2}
3. {2, 0, 0}  12. {2, 0, 1}  21. {2, 0, 2}
4. {0, 1, 0}  13. {0, 1, 1}  22. {0, 1, 2}
5. {1, 1, 0}  14. {1, 1, 1}  23. {1, 1, 2}
6. {2, 1, 0}  15. {2, 1, 1}  24. {2, 1, 2}
7. {0, 2, 0}  16. {0, 2, 1}  25. {0, 2, 2}
8. {1, 2, 0}  17. {1, 2, 1}  26. {1, 2, 2}
9. {2, 2, 0}  18. {2, 2, 1}  27. {2, 2, 2}

The kernels can make a call to size_t get_indices( int dim) to retrieve work item information.

Ex. 
  void elementwise_matrix_multiplication( float * a, float * b, float * c)
  {
    auto x = get_indices(0); #Get Dimension 0 current indices 
    c[x] = a[x] * b[x];
    return;
  }

Example compilation

Ex g++ main.cc -o host.bin -I./build/include -lflash_wrapper -std=c++2a -ldl -lpthread -rdynamic

*Remember to point LD_LIBRARY_PATH to [dir]/build/lib64

Example unit test

using MATMULT   = KernelDefinition<2, "elmatmult_generic", kernel_t::INT_BIN, float *, float * >;         
using MATDIV    = KernelDefinition<2, "elmatdiv_generic",  kernel_t::INT_BIN, float*, float*>;            
using MATDIV_T  = KernelDefinition<2, "TEST::elmatdiv_generic", kernel_t::INT_BIN, TEST*, float*, float*>;

int main(int argc, const char * argv[])                                                     
{                                                                                           
    //Design Patterns                                                                       
    // Lazy execution                                                                       
    // Builder                                                                              
    // Lookup                                                                               
    // Reflection                                                                           
    // Dynamic dispatching                                                                  
    // Self-registry factory                                                                
    size_t sz = 512;                                                                        
    TEST t1(33);                                                                            
                                                                                            
    auto chunk = aligned_vector<float>(6*sz, 2);                                            
    float * A = chunk.data(), *B = A + sz, *C = B + sz;                                     
    float * E = C + sz, *F = E + sz, *G = F + sz;                                           
                                                                                            
    RuntimeObj ocrt(flash_rt::get_runtime("ALL_CPU") , MATMULT{ argv[0] },                  
                    MATDIV{argv[0]} );                                                      
    //submit                                                                                
    ocrt.submit(MATMULT{}, A, B, C).sizes(sz,sz,sz).defer(32,1,1) 
        .submit(MATDIV{},  C, F, G).sizes(sz,sz,sz).exec(32,1,1);
                                                                                                                                                                                   
    std::cout << "C = ";                                                                    
    for(auto i : std::views::iota(0,9) )                                                    
    {                                                                                       
      std::cout << C[i] << ",";                                                             
    }                                                                                       
    std::cout << C[10] << std::endl;                                                        
                                                                                            
    std::cout << "G = ";                                                                    
    for(auto i : std::views::iota(0, 9) )                                                   
    {                                                                                       
      std::cout << G[i] << ",";                                                             
    }                                                                                       
    std::cout << G[10] << std::endl;                                                        
                                                                                            
                                                                                            
    return 0;                                                                               
}

nbody and particle diffusion kernel unit tests can be compiled with the following command lines:

(CPU) nvcc cpu_main.cc cpu_kernels.cc -o particle-diffusion.cpu -std=c++11 nvcc cpu_main.cc cpu_kernels.cc -o nbody.cpu -std=c++11

(GPU) nvcc cuda_main.cu cuda_kernels.cu -o particle-diffusion.gpu -std=c++11 nvcc cuda_main.cu cuda_kernels.cu -o nbody.gpu -std=c++11

PSCLab-ASU/FLASH