/kifmm--hybrid--double-only

CPU-GPU hybrid implementation of KIFMM in double-precision only

Primary LanguageC++BSD 3-Clause "New" or "Revised" LicenseBSD-3-Clause

KIFMM benchmark

This repository contains CPU+GPU benchmark implementations of the kernel-independent fast multipole method (KIFMM), which was used most recently in the following papers:

  • Jee Choi, Aparna Chandramowlishwaran, Kamesh Madduri, and Richard Vuduc. "A CPU-GPU hybrid implementation and model-driven scheduling of the fast multipole method." In Proceedings of the 7th Workshop on General-Purpose Processing using GPUs (GPGPU-7), Salt Lake City, UT, USA, March 2014. doi:10.1145/2576779.2576787

  • Aparna Chandramowlishwaran, Jee Choi, Kamesh Madduri, and Richard Vuduc. "Brief announcement: Towards a communication optimal fast multipole method and its implications at exascale." In Proceedings of the 24th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA'12), Pittsburgh, PA, USA, June 25-27, 2012. doi:10.1145/2312005.2312039.

See Other notes, below, for more information.

Feel free to post issues or pull requests on GitHub: https://github.com/jeewhanchoi/kifmm--hybrid--double-only

Compiling

From the root directory, run:

    make clean; make

The build will generate several instances of the benchmark as separate executables (see Description below).

Executing

    fmmd--{naive,omp,omp_sse,omp_sse_block,cuda,hybrid1,hybrid2,hybrid3} \
	    <# pts> {uniform,ellipseUniformAngles} <# pts-per-box>

Description

This benchmark includes CPU, GPU, and CPU+GPU hybrid implementations for the fast multipole method (FMM) in double-precision.

The benchmark generates the follwing executables,

  • fmmd--naive: Baseline sequential code
  • fmmd--omp: OpenMP parallelized code
  • fmmd--omp_sse: OpenMP parallelized + SIMD vectorized code
  • fmmd--omp_sse_block: OpenMP parallelized + SIMD vectorized + Blocking (translation vector) + Blocking (up)
  • fmmd--cuda: CUDA code
  • fmmd--hybrid1: U-list on GPU; up, V-list, down on CPU
  • fmmd--hybrid2: U-list on CPU; up, V-list, down on GPU
  • fmmd--hybrid3: A optimal schedule for non-uniform distributions on hybrid CPU-GPU systems

Environment Variables

  • NUMA-aware memory allocation can be set/unset using the environment variable, NUMA. Default = yes.

Note: When doing NUMA-aware memory allocation, threads must be pinned appropriately by, for instance, using the appropriate environment variable, e.g., KMP_AFFINITY (icc) or GOMP_ AFFINITY (gcc). Example:

    export KMP_AFFINITY=granularity=fine,compact,1,0,verbose # Without hyperthreading; or:
    export KMP_AFFINITY=granularity=fine,compact,verbose # with hyperthreading
  • Number of threads can be varied by changing the environment variable, OMP_NUM_THREADS. Default = Max # of threads.

  • When blocking is enabled, translation block size can be set by the environment variable, BS. Default = 49.

  • When blocking is enabled, up block size can be set by the environment variable, UPBS. Default = 1000.

  • Accuracy can be varied by changing the environment variable, NP. Default = 6.

On the GPU, only 3 precision are supported (NP=3, NP=4, and NP=6). You must also set #define NP_(X) in cuda.cu to 1 (all others should be set to 0) and re-compiled in order for the GPU version to work. Lastly, env NP=<X> must also be set as the GPU code uses CPU's tree construction code.

  • The error is computed by taking a random sample. This paramter can be varied by changing the environment variable, NV. Default = 1000.

Example usage

	# with NP_6 set to 1, GPU FMM with uniform 
    # distribution and 6 digits of precision:
    ./fmmd--cuda 4194304 uniform 512
	
	# with NP_4 set to 1, GPU FMM with uniform
	# distribution and 4 digits of precision:
    env NP=4 ./fmmd--cuda 1048576 uniform 512
	
	# Most optimized CPU FMM with 3 digits of
	# precision:
    env NP=3 ./fmmd--omp_sse_block 1048576 uniform 128

Other notes

  • By default, for the GPU version, all data structures are allocated in the memory prior to execution. However, due to CUFFT taking too much memory, for large number of points and/or higher precision, there is support to allocate data as they are needed. This can be automatically turned on and off setting the MIN_DATA definition found in partial.h to 1 or 0.

  • Our code derives its main algorithmic ideas from KIFMM3d, which was developed by others: http://cs.nyu.edu/~harper/kifmm3d/documentation/index.html . The focus of our implementation is single-node performance optimization.

Contributors