/pangolin

Primary LanguageCuda

Pangolin

Netlify Status

Branch Status
master Build Status

A header-only C++/CUDA library for GPU graph operations. C++ API Documentation.

Getting Started

Install CUDA

Most of pangolin only works with nvcc.

Install NUMA (optional)

sudo apt install libnuma-dev

Include the pangolin headers in your code, and somewhere before using pangolin, call pangolin::init()

#include "pangolin.hpp"
#include "pangolin.cuh"

int main(int argc, char **argv) {
    pangolin::init();
    // your code here.
}

Testing

mkdir -p build && cd build
make
make test

Library Features

GPU Graph Algorithms

  • Triangle Counting
    • Anjur-Iyer: triangle counting algorithm created by Anjur and Iyer in ECE 508.
    • Bisson-Fatica 2019: triangle counting algorithm from Bisson & Fatica
    • Edge-Binary: Edge-oriented count with binary search intersection
    • Edge-Dyn: Edge-oriented count with warp-granularity dynamic load balancing and cost-based binary/sequential search selection
    • Edge-dynpar: Edge-oriented binary search count with dynamic parallelism
    • Edge-Dysel: Edge-oriented count with empirical-performance measurement and algorithm selection
    • Edge-Linear: Edge-oriented count with linear-search intersection
    • Edge-Merge: Edge-oriented count with merge-path search
    • Task-GPU: 2D decomposition count
    • Vertex-Bitvector: vertex-oriented bitvector search in shared memory
    • Vertex-Block-Binary: vertex-oriented binary intersection threadblock-per-row
    • Vertex-Blocks-Binary: vertex-oriented binary intersection threadblocks-per-row
      • Vertex-Blocks-Cache-Block-Binary:
    • Vertex-CPU: CPU counters
    • Vertex-Dyn: Like Edge-Dyn but vertex-oriented
    • Vertex-Warp-Bitvector: vertex oriented count with warp-collaborative bit vectors
  • K-truss Note: all benchmark files that use these algorithms are found in graph challenge project.
    • K-truss Incremental <ktruss_serial.cuh> , single GPU, starts from k=3 until max k is found. Physically removes edges from the graph.
    • K-truss Binary <ktruss_binary_HSC.cuh>, single GPU, searches for max k using binary steps. Physically removes edges from the graph. HSC stands for hard stream compaction.
    • K-truss Incremental Multi GPU <ktruss_incremental_multiGPU>, multi GPU, has only support functions, the real implementation is included in the graph challenge project.
  • broadcast: include/pangolin/algorithm/broadcast.cuh
    • shared memory and shuffle instructions
    • warp-collaborative and block-collaborative
    • non-primitive types
  • elementwise: include/pangolin/algorithm/elementwise.cuh
    • Element-wise CSR matrix multiplication
  • Grid-collaborative Fill: include/pangolin/algorithm/fill.cuh
  • Load-Balance:
    • For a set of objects creating work-items, determine which object created each work item
  • Merge-Path:
    • single-thread compute location where merge-path crosses diagonal
  • Reduction: include/pangolin/algorithm/search.cuh
    • warp-collective sum/min/max with shuffle instructions
  • Search: include/pangolin/algorithm/search.cuh
    • single-thread binary search __device__ function: serial_sorted_search_binary
      • upper and lower bound
    • single-thread pointer-chasing linear search __device__ function: serial_sorted_search_linear

Get started at [include/pangolin/algorithm].

RcStream

A reference-counted cudaStream_t. Automatically create, share, and destroy a single cudaStream_t, analogous to a std::shared_ptr. Get started at [include/pangolin/cuda_cxx/rc_stream.hpp].

Allocators

C++ stdlib allocators for CUDA device memory, CUDA host memory, and CUDA managed memory. Get started at [include/pangolin/allocator].

Dense containers

Vectors and Buffers backed by C++ allocator Get started at [include/pangolin/dense].

Sparse containers

CSR and CSR+COO sparse matrices backed by pangolin::Vector Get started at [include/pangolin/sparse].

Buffers and Queues

  • Double Buffer: double_buffer.hpp a fixed-size single-producer/single-consumer double buffer
  • Bounded BUffer: bounded_buffer.hpp a fixed-size blocking multi-producer multi-consumer ring buffer

System Topology Exploration

Built on top of numa and nvidia-ml, query the system topology to discover which GPUs, CPUs, and NUMA regions are associated. Get started at [include/pangolin/topology].

Controlling Logging

pangolin::logger::set_level(pangolin::logger::Level::ERR)

Allowed values are TRACE, DEBUG, INFO, WARN, ERR, CRITICAL.

API Documentation

API documentation is available at pangolin-docs.netlify.com.

Building Pangolin from Source

Prerequisites

Dockerfile cpu CUDA c++ CMake Builds
test_cuda92-ubuntu1804.Dockerfile amd64 9.2 g++ 7.3.0 3.11.0
test_cuda100-ubuntu1804.Dockerfile amd64 10.0 g++ 7.3.0 3.11.0
- amd64 (Ubuntu 16.04) 10.0.130 g++ 5.4.0 3.14.3
POWER9 9.2.148 clang 5.0.0 3.12.0
POWER9 9.2.148 g++ ??? 3.12.0 ???
test_cuda80-ubuntu1404.Dockerfile amd64 8.0.61 g++ 4.8.4 3.11.0
test_cuda80-ubuntu1404-clang38.Dockerfile amd64 8.0.61 clang 3.8.0 3.11.0 (needs check) x: problem parsing Vector
test_cuda80-ubuntu1604.Dockerfile amd64 8.0.61 g++ 5.4.0 3.11.0 (needs check) x: problem parsing Vector
test_cuda92_ubuntu1604-clang5.Dockerfile amd64 9.2.148 clang 5.0.0 3.11.0 x: problem with simd intrinsics
- amd64 9.2.148 g++5.4.1 3.13.3 x: problem with std::to_string in catch2
  1. Install CUDA

Instructions for installing CUDA on supported systems may be obtained from Nvidia's website.

  1. Install CMake 3.13+

On x86 linux, CMake provides prebuilt binaries with a shell script. On POWER, you will need to build CMake from source. You can check your cmake version with cmake --version. CMake will need to built with support for SSL.

  1. (Optional) Install Doxygen/Graphviz for API documentation

Install doxygen and graphviz

sudo apt install doxygen graphviz

If doxygen is installed, building pangolin will also create API documentation.

Building on Ubuntu/Debian

Pangolin is a header-only library, but you can still build the tests

mkdir -p build
cd build
cmake ..
make
make tests

Using Pangolin in another CMake project

See Pangolin_Example for an example.

Pangolin may be used with CMake add_subdirectory(), or installed and used with CMake find_package(pangolin CONFIG). Pangolin exports the CMake pangolin::pangolin target.

As a Subdirectory

  1. Add Pangolin as a git submodule
git submodule add https://github.com/c3sr/pangolin.git thirdparty/pangolin
cd thirdparty/pangolin
git checkout <hash, branch, etc>
  1. Put add_subdirectory(...) in your CMakeLists
# CMakeLists.txt
add_subdirectory(thirdparty/pangolin)
...
target_link_libraries(... pangolin::pangolin32)

As an external Library

Pangolin is a header-only library, so "installation" is a matter of copying pangolin's headers to a desired location. Pangolin also includes a CMake config file for easy integration with other CMake projects.

  1. Clone and install pangolin to the desired location
git clone https://github.com/c3sr/pangolin.git
mkdir pangolin/build && cd pangolin/build
cmake .. -DCMAKE_INSTALL_PREFIX=<something>
make install
  1. Use -DCMAKE_PREFIX_PATH CMake option and find_package(pangolin CONFIG REQUIRED) in your CMakeLists
# CMakeLists.txt
find_package(pangolin CONFIG REQUIRED)
...
target_link_libraries(... pangolin::pangolin)

Running tests

Tests can be built and run with

make
make test

Most tests require a GPU (those tests have the gpu label). Some tests require MPI (those tests have the mpi label)

ctest -LE "gpu" # run tests that do not require a GPU
ctest -L "mpi" # run tests that require MPI

To run individual tests, you can do something like

make
test/test_csr

Continuous Integration

We automatically build and test the following configurations.

| CI Platform | CUDA | NUMA | MPI | Build | Test | |-|-|-|-|-| | Azure Pipelines | 10.1 | Yes | Yes | Yes | Yes | | Azure Pipelines | 10.1 | No | Yes | Yes | Yes | | Azure Pipelines | 10.1 | Yes | No | Yes | non-mpi | | Azure Pipelines | 10.1 | No | No | Yes | non-mpi |

Profiling

CUDA metrics guide

GPUs with CC <= 7.2

Non-interactive profiling is done by creating two different profiling files: a timeline file, and a metrics file. These are created with two separate invocations of nvprof:

nvprof -o timeline.nvvp -f ./<exe> ...
nvprof -o metrics.nvvp -f --analysis-metrics ./<exe>

These files can be opened in nvvp.

File >> import >> nvprof >> single process

timeline data file: timeline.nvvp Event/Metric data files: metrics.nvvp

GPUs with CC > 7.2

On GPUs with CC > 7.2, some version of Nsight needs to be used. Either open the NVIDIA Nsight Compute profiler and do it interactively, or generate a report and import it

/usr/local/cuda/NsightCompute-1.0/nv-nsight-cu-cli -o profile -f ...

or

/usr/local/cuda/NsightCompute-1.0/nv-nsight-cu&

File > Open File > profile.nsight-cuprof-report

Interactive Timeline

The Nvidia Nsight Eclipse Edition can generate a timeline

  • Open Nsight Eclise Edition

On Ubuntu 18.04, there may be a conflict with the installed Java runtime (usually openjdk-11).

sudo apt install openjdk-8-jre

Then add the path to the java8 runtime to the top of /usr/local/cuda/libnsight/nsight.ini like so

-vm
/usr/lib/jvm/java-8-openjdk-amd64/jre/bin
  • Run >> Profile Configurations
  • Select "C/C++ Application" in the left panel
  • Press the 'New' button near the top-left to create a new profiling configuration
  • Give it a name near the top middle if you want
  • Main tab >> Put the path to the CUDA binary in "C/C++ Application"
  • Arguments tab >> put the command line arguments to the binary here

Non-interactive Timeline

Generate a timeline with nvprof

nvprof -o timeline.nvvp -f ./mybin

  • Open Nsight Eclipse Edition
  • File >> Import
    • CUDA v Nvprof
    • Put the timeline file in the timeline file box
    • Uncheck Use fixed width segments for unified memory timeline

Interactive Detailed Profile

  • Open "NVIDIA Nsight Compute"
  • "Create New Project" >> cancel
  • "Connect" in top-left
    • Put the path to the application in "Application Excecutable"
    • Put the arguments in "Command Line Arguments"
    • Enable NVTX Support = Yes
  • Launch
  • Profile >> auto-profile

profiler report source page

Measuring Divergence: "predicated-on thread instructions executed" should be 32x "instructions executed" for no divergence.

"predicated-on thread instructions executed": instructions executed: number of times an instruction was executed by a warp Number of times the source (instruction) was executed by any active, predicated-on thread. For instructions that are executed unconditionally (i.e. without predicate), this is the number of active threads in the warp, multiplied with the respective Instructions Executed value.

Non-interactive Detailed Profile

Generate a profile using something like /usr/local/cuda/NsightCompute-1.0/nv-nsight-cu-cli -o ./my-binary

  • Open "NVIDIA Nsight Compute"
  • File >> Open >> <the profile file you generated>

Other

See references for some notes on references.

Publications

@INPROCEEDINGS{8916285,
author={M. {Almasri} and O. {Anjum} and C. {Pearson} and Z. {Qureshi} and V. S. {Mailthody} and R. {Nagi} and J. {Xiong} and W. {Hwu}},
booktitle={2019 IEEE High Performance Extreme Computing Conference (HPEC)},
title={Update on k-truss Decomposition on GPU},
year={2019},
volume={},
number={},
pages={1-7},
}
@INPROCEEDINGS{8916547,
 author={C. {Pearson} and M. {Almasri} and O. {Anjum} and V. S. {Mailthody} and Z. {Qureshi} and R. {Nagi} and J. {Xiong} and W. {Hwu}},
 booktitle={2019 IEEE High Performance Extreme Computing Conference (HPEC)},
 title={Update on Triangle Counting on GPU},
 year={2019},
 volume={},
 number={},
 pages={1-7},
 } 
@INPROCEEDINGS{8547517,
author={V. S. {Mailthody} and K. {Date} and Z. {Qureshi} and C. {Pearson} and R. {Nagi} and J. {Xiong} and W. {Hwu}}, booktitle={2018 IEEE High Performance extreme Computing Conference (HPEC)},
title={Collaborative (CPU + GPU) Algorithms for Triangle Counting and Truss Decomposition},
year={2018},
volume={},
number={},
pages={1-7},