CombBLAS: A C++ repository from lacrymose

This is the development repository of Combinatorial BLAS.

Copyright

Combinatorial BLAS, Copyright (c) 2020, The Regents of the University of California, through Lawrence Berkeley National Laboratory (subject to receipt of any required approvals from the U.S. Dept. of Energy) and University of California, Santa Barbara. All rights reserved.

If you have questions about your rights to use or distribute this software, please contact Berkeley Lab's Innovation & Partnerships Office.

NOTICE. This Software was developed under funding from the U.S. Department of Energy and the U.S. Government consequently retains certain rights. As such, the U.S. Government has been granted for itself and others acting on its behalf a paid-up, nonexclusive, irrevocable, worldwide license in the Software to reproduce, distribute copies to the public, prepare derivative works, and perform publicly and display publicly, and to permit other to do so.

This material is based upon work supported by the National Science Foundation under Grant No. 0709385 and by the Department of Energy, Office of Science, ASCR Contract No. DE-AC02-05CH11231. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation (NSF) and the Department of Energy (DOE). This software is released under the following license.

Introduction

The Combinatorial BLAS (CombBLAS) is an extensible distributed-memory parallel graph library offering a small but powerful set of linear algebra primitives specifically targeting graph analytics.

The Combinatorial BLAS development influences the Graph BLAS standardization process.
It achieves scalability via its two dimensional distribution and coarse-grained parallelism.
For an illustrative overview, check out these slides.
CombBLAS powers HipMCL, a highly-scalable parallel implementation of the Markov Cluster Algorithm (MCL).
Operations among sparse matrices and vectors use arbitrary user defined semirings. Here is a semiring primer
Check out the Frequently asked questions about CombBLAS.

Download

Just run git clone https://github.com/PASSIONLab/CombBLAS.git for the latest code
NERSC users read this
The old CMake'd tarball (version 1.6.2, April 2018) here. An even earlier version (version 1.6.1, Jan 2018) is also available here for backwards compatibility and benchmarking. Read release notes.
Installation and testing can be done by executing these commands within the CombBLAS directory:
1. mkdir _build
2. mkdir _install
3. cd _build
4. cmake .. -DCMAKE_INSTALL_PREFIX=../_install
5. make
6. make install
7. ctest -V (you need the testinputs, see below)

If running on a Mac, we recommend using gcc compilers instead of clang (which has issues with OpenMP). For that, all you need to do is to replace step (4) above with

cmake .. -DCMAKE_INSTALL_PREFIX=../_install -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++

Test inputs are separately downloadable here. Extract them inside the _build directory you've just created with the command "tar -xzvf testdata_combblas1.6.1.tgz"

Alternatively (if cmake fails, or you just don't want to install it), you can just imitate the sample makefiles inside the ReleaseTests and Applications directories. Those sample makefiles have the following format: makefile-machine. (example: makefile-macair)
The CMake now automatically compiles for hybrid MPI+OpenMP mode because almost all expensive primitives are now multithreaded. Example makefiles are also multithreaded for many cases. You just need to make sure that your OMP_NUM_THREADS environmental variable is set to the right value for the configuration you are running and you are not oversubscribing or undersubscribing cores.
At this point, you can incorporate CombBLAS into your own code by linking against the contents of the _install/lib directory and including the header _install/include/CombBLAS/CombBLAS.h. If you need an example, SuperLU_Dist does that

Requirements: You need a recent C++ compiler (gcc version 4.8+, Intel version 15.0+ and compatible), a compliant MPI implementation, and C++11 Standard library (libstdc++ that comes with g++ has them). The recommended tarball uses the CMake build system, but only to build the documentation and unit-tests, and to automate installation. The chances are that you're not going to use any of our sample applications "as-is", so you can just modify them or imitate their structure to write your own application by just using the header files. There are very few binary libraries to link to, and no configured header files. Like many high-performance C++ libraries, the Combinatorial BLAS is mostly templated.

Documentation: This is a beta implementation of the Combinatorial BLAS Library in written in C++ with MPI and OpenMP for parallelism. It is purposefully designed for distributed memory platforms though it also runs in uniprocessor and shared-memory (such as multicores) platforms. It contains efficient implementations of novel data structures/algorithms as well as reimplementations of some previously known data structures/algorithms for convenience. More details can be found in the accompanying paper [1]. One of the distinguishing features of the Combinatorial BLAS is its decoupling of parallel logic from the sequential parts of the computation, making it possible to implement new formats and plug them in without changing the rest of the library.

For I/O purposes, the implementation supports both a tuples format very similar to the Matrix Market and the Matrix Market format itself. We recommend using the Matrix Market version and associated ParallelReadMM() functions. We encourage in-memory generators for faster benchmarking.

The main data structure is a distributed sparse matrix ( SpParMat <IT,NT,DER> ) which HAS-A sequential sparse matrix ( SpMat <IT,NT> ) that can be implemented in various ways as long as it supports the interface of the base class (currently: SpTuples, SpCCols, SpDCCols).

For example, the standard way to declare a parallel sparse matrix A that uses 32-bit integers for indices, floats for numerical values (nonzeros), SpDCCols <int,float> for the underlying sequential matrix operations is:

SpParMat<int, float, SpDCCols<int,float> > A;

Sparse and dense vectors are distributed along all processors. This is very space efficient and provides good load balance for SpMSV (sparse matrix-sparse vector multiplication).

New since version 1.6:

Connected components in distributed memory, found in Applications/CC.h [14,15], compile with "make cc" in that folder. Usage self explanatory (just try ./cc without any parameters to get usage)
Incorporation of much faster shared-memory hash SpGEMM implementation [16] from Yusuke Nagasaka
Initial CUDA support (for HipMCL initially [17]) for sparse matrix-matrix multiplication
3D process grid support for reducing communication in sparse matrix-matrix multiplication [18]

New in version 1.6:

In-node multithreading enabled for many expensive operations.
Fully parallel text-file reader for vectors (FullyDistSpVec::ParallelReadMM() and FullyDistVec::ParallelReadMM())
Fully parallel text-file writer for vectors (FullyDistSpVec::ParallelWrite () and FullyDistVec::ParallelWrite())
Reverse Cuthill-McKee (RCM) ordering implementation. Please cite [12] if you use this implementation
Novel multithreaded SpGEMM and SpMV (with sparse vectors) algorithms are integrated with the rest of CombBLAS.
- For benchmarking multithreaded SpMV with sparse vectors, go to Applications/SpMSpV-IPDPS2017 directory and use the code there.
- Please cite [13] if you use the new multithreaded SpMV with sparse vectors.
Extended CSC support
Previously deprecated SpParVec and DenseParVec (that were distributed to diagonal processors only) classes are removed.
Lots of more bug fixes

New in version 1.5:

Fully parallel matrix market format reader (SpParMat::ParallelReadMM())
Complete multithreading support, including SpGEMM (previously it was solely SpMV), enabled by -DTHREADED during compilation
Experimental 3D SpGEMM (the ability to switch processor grids from 2D to 3D will have to wait for version 1.6)
- Cite [9] if you use this implementation
- cd 3DSpGEMM/, make test_mpipspgemm, and call the executable with correct parameters
Maximal and Maximum cardinality matching algorithms on bipartite graphs
- Cite [10] for maximal cardinality and [11] for maximum cardinality matching
- cd MaximumMatching, make bpmm, and call the executable with correct parameters
Automated MPI_Op creation from simple C++ function objects (simplifies semiring descriptions and Reduce() functions)
FullyDistSpVec::Invert() to map from/to (integer) values to/from indices
Many more helper functions
Experimental CSC support for low concurrencies
Lots of bug fixes

New in version 1.4:

Direction optimizing breadth-first search in distributed memory (in collaboration with Scott Beamer and GAP). Please cite [8] if you use this code in your research or benchmarks (DirOptBFS.cpp).

The supported operations (a growing list) are:

Sparse matrix-matrix multiplication on a semiring SR: PSpGEMM()
Elementwise multiplication of sparse matrices (A .* B and A .* not(B) in Matlab): EWiseMult()
Unary operations on nonzeros: SpParMat::Apply()
Matrix-matrix and matrix-vector scaling (the latter scales each row/column with the same scalar of the vector)
Reductions along row/column: SpParMat::Reduce()
Sparse matrix-dense vector multiplication on a semiring, SpMV()
Sparse matrix-sparse vector multiplication on a semiring, SpMV()
Generalized matrix indexing: SpParMat::operator(const FullyDistVec & ri, const FullyDistVec & ci)
Generalized sparse matrix assignment: SpParMat::SpAsgn (const FullyDistVec & ri, const FullyDistVec &ci, SpParMat & B)
Numeric type conversion through conversion operators
Elementwise operations between sparse and dense matrices: SpParMat::EWiseScale() and operator+=()
BFS specific optimizations inside BFSFriends.h

All the binary operations can be performed on matrices with different numerical value representations. The type-traits mechanism will take care of the automatic type promotion, and automatic MPI data type determination.

Some features it uses:

templates (for generic types, and for metaprogramming through "curiously recurring template pattern")
operator overloading
compositors (to avoid intermediate copying)
standard library whenever possible
Reference counting using shared_ptr for IITO (implemented in terms of) relationships
As external code, it utilizes
- sequence heaps of Peter Sanders .
- a modified (memory efficient) version of the Viral Shah's PSort .
- a modified version of the R-MAT generator from Graph 500 reference implementation

Important Sequential classes:

SpTuples : uses triples format to store matrices, mostly used for input/output and intermediate tasks (such as sorting)
SpDCCols : implements Alg 1B and Alg 2 [2], holds DCSC.
SpCCols : implements CSC

Important Parallel classes:

SpParMat : distributed memory MPI implementation
Each processor locally stores its submatrix (block) as a sequential SpDCCols object
Uses a polyalgorithm for SpGEMM: For most systems this boils down to a BSP like Sparse SUMMA [3] algorithm.
SpParMat3D : sparse matrix distributed in 3D process grid
FullyDistVec : dense vector distributed to all processors
FullyDistSpVec: : sparse vector distributed to all processors

Applications implemented using Combinatorial BLAS:

BetwCent.cpp : Betweenness centrality computation on directed, unweighted graphs. Download sample input here .
TopDownBFS.cpp: A conformant implementation of the Graph 500 benchmark using the traditional top-down BFS.
DirOptBFS.cpp: A conformant implementation of the Graph 500 benchmark using the faster direction-optimizing BFS.
FilteredMIS.cpp: Filtered maximal independent set calculation on ER graphs using Luby's algorithm.
FilteredBFS.cpp: Filtered breadth-first search on a twitter-like data set.
BipartiteMatchings/BPMaximalMatching.cpp: Maximal matching algorithms on bipartite graphs [10]
BipartiteMatchings/BPMaximumMatching.cpp: Maximum matching algorithm on bipartite graphs [11]
Ordering/RCM.cpp: Reverse Cuthill-McKee ordering on distributed memory [12]
CC.cpp: Linear-algebraic connected components [14, 15]
MCL3D.cpp: HipMCL using 3D process grid

Performance results of the first two applications can be found in the design paper [1]; Graph 500 results are in a recent BFS paper [4]. The most recent sparse matrix indexing, assignment, and multiplication results can be found in [5]. Performance of filtered graph algorithms (BFS and MIS) are reported in [7]. Performance of the 3D SpGEMM algorithm can be found in [9]

A subset of test programs demonstrating how to use the library (under ReleaseTests):

TransposeTest.cpp : File I/O and parallel transpose tests
MultTiming.cpp : Parallel SpGEMM tests
IndexingTest.cpp: Various sparse matrix indexing usages
ParIOTest.cpp Parallel reading of arbitrary labeled tuples with SpParMat::ReadGeneralizedTuples()
ReadWriteMtx.cpp: Parallel matrix-market I/O
VectorIO.cpp: Parallel Vector I/O
GenWriteMatrix.cpp: Parallel generating and writing of Kronecker graphs
SpAsgnTiming.cpp: Sparse matrix assignment usage and timing.
FindSparse.cpp : Parallel find/sparse routines akin to Matlab's.
GalerkinNew.cpp : Graph contraction or restriction operator (used in Algebraic Multigrid).
Application/SpGEMM3D.cpp : Using sparse matrix-matrix multiplication using 3D process grid.

Citation: Please cite the design paper [1] if you end up using the Combinatorial BLAS in your research.

[1] Aydin Buluc and John R. Gilbert, The Combinatorial BLAS: Design, implementation, and applications . International Journal of High Performance Computing Applications (IJHPCA), 2011. Preprint , Link
[2] Aydin Buluc and John R. Gilbert, On the Representation and Multiplication of Hypersparse Matrices . The 22nd IEEE International Parallel and Distributed Processing Symposium (IPDPS 2008), Miami, FL, April 14-18, 2008
[3] Aydin Buluc and John R. Gilbert, Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication . The 37th International Conference on Parallel Processing (ICPP 2008), Portland, Oregon, USA, 2008
[4] Aydin Buluc and Kamesh Madduri, Parallel Breadth-First Search on Distributed-Memory Systems . Supercomputing (SC'11), Seattle, USA. Extended preprint PDF
[5] Aydin Buluc and John R. Gilbert. Parallel Sparse Matrix-Matrix Multiplication and Indexing: Implementation and Experiments . SIAM Journal of Scientific Computing, 2012. Preprint PDF
[6] Aydin Buluc. Linear Algebraic Primitives for Computation on Large Graphs . PhD thesis, University of California, Santa Barbara, 2010. PDF
[7] Aydin Buluc, Erika Duriakova, Armando Fox, John Gilbert, Shoaib Kamil, Adam Lugowski, Leonid Oliker, Samuel Williams. High-Productivity and High-Performance Analysis of Filtered Semantic Graphs , International Parallel and Distributed Processing Symposium (IPDPS), 2013. PDF
[8] Scott Beamer, Aydin Buluc, Krste Asanovic, and David Patterson. Distributed memory breadth-first search revisited: Enabling bottom-up search. In Workshop on Multithreaded Architectures and Applications (MTAAP), in conjunction with IPDPS. IEEE Computer Society, 2013. PDF
[9] Ariful Azad, Grey Ballard, Aydin Buluc, James Demmel, Laura Grigori, Oded Schwartz, Sivan Toledo, and Samuel Williams. Exploiting multiple levels of parallelism in sparse matrix-matrix multiplication. SIAM Journal on Scientific Computing (SISC), 38(6):C624-C651, 2016. PDF
[10] Ariful Azad and Aydin Buluc. Distributed-memory algorithms for maximal cardinality matching using matrix algebra. In IEEE International Conference on Cluster Computing (CLUSTER), 2015. PDF
[11] Ariful Azad and Aydin Buluc. Distributed-memory algorithms for maximum cardinality matching in bipartite graphs. In Proceedings of the IPDPS, 2016. PDF
[12] Ariful Azad, Mathias Jacquelin, Aydin Buluc, and Esmond G. Ng. The reverse Cuthill-McKee algorithm in distributed-memory. In Proceedings of the IPDPS, 2017. PDF
[13] Ariful Azad and Aydin Buluc. A work-efficient parallel sparse matrix-sparse vector multiplication algorithm. In Proceedings of the IPDPS, 2017. PDF
[14] Yongzhe Zhang, Ariful Azad, and Aydin Buluc. "Parallel algorithms for finding connected components using linear algebra." Journal of Parallel and Distributed Computing (2020). PDF
[15] Ariful Azad and Aydin Buluc. LACC: a linear-algebraic algorithm for finding connected components in distributed memory. In Proceedings of the IPDPS, 2019. PDF
[16] Yusuke Nagasaka, Satoshi Matsuoka, Ariful Azad, and Aydin Buluc. "Performance optimization, modeling and analysis of sparse matrix-matrix products on multi-core and many-core processors." Parallel Computing 90 (2019): 102545 . PDF
[17] Oguz Selvitopi, Md Taufique Hussain, Ariful Azad, and Aydin Buluc. Optimizing high performance Markov clustering for pre-exascale architectures. In Proceedings of the IPDPS, 2020 . PDF
[18] Md Taufique Hussain, Oguz Selvitopi, Aydin Buluc, Ariful Azad. Communication-Avoiding and Memory-Constrained Sparse Matrix-Matrix Multiplication at Extreme Scale. . PDF

lacrymose/CombBLAS