DPSL is a distributed partitioning based implementation of the Parallel Shortest-distance Labelling (PSL) algorithm. DPSL can run on multiple nodes distributing the graph and labels among the nodes. This results in both faster indexing times by allowing the use of more computation resources and less storage requirements for each node since the resulting labels will be distributed among them with minimal duplication.
Disclaimer 1: This work is designed specifically for social networks, web graphs and other low diameter graphs. It will not work well on road networks. And in fact with certain features turned on, it might give incorrect results due to overflows on high diameter graphs. On such graphs, we suggest using "ORDER_METHOD=b_cent COMP_LVL=1 USE_BP=false". Note that there is a limit on the maximum distance.
Disclaimer 2: This work is designed to work with connected graphs. For heavily disconnected graphs it may work but it is not recommended. Especially when using the compression & elimination methods. In such cases the graph can be split into multiple components and processed separately.
- GNU Make (tested with version 4.2.1)
- g++ (tested with versions 7.0.0 and 9.3.0)
- (For DPSL only) openmpi (with
mpic++
compiler andmpirun
tool) (tested with version 3.0.0) - (For DPSL only) mtmetis (.a and .h files)
- (For GPSL only) Nvidia CUDA (with
nvcc
compiler) (tested with version 11.2)
For DPSL, we need several partitioning libraries.
Currently these are:
On most Linux systems, you can use scripts/get_deps.sh
script to obtain and build them automatically.
Please run the script on the project root. Like:
bash scripts/get_deps.sh
.
Otherwise you will need to build them manually and place the .a
, .so
and .h
files in the libs
directory.
We suggest building both metis and mtmetis as shared librares, otherwise they may cause linker issues
Mt-kahypar is implemented as an optional dependency as it requires Boost (Their minimal boost option did not work on our end).
To enable support for it, pass MT_KAHYPAR=DEF
or MT_KAHYPAR=QUAL
to make
depending on which preset you want.
The aforementioned script can be used to automatically download and build mt-kahypar as well just pass the --mtkahypar
option.
Note that the mt-kahypar repo is actively updated sometimes with breaking changes. I suggest using the following commit for this program: f21a4195f66f1a63b252482cdf819ae0e8acdcd0.
Basic syntax:
make -B <binary> <options>
This project currently includes 3 different binaries that could be build:
dpsl
: The main result of the project, the distributed PSL implementationpsl
: A reimplementation of the original PSL algorithm. According to our tests this version is faster than the original. It also provides several quality-of-life features like being able to read different file formats.gpsl
: An experimental GPU implementation (currently WIP)
Please see Makefile
for the options. Note that options marked experimental are not guaranteed to work.
Some of the options for ORDER_METHOD are estimated values and are calculated using a multi-threaded approach that may result in slighly different results from run to run. As a result certain metrics like memory usage may vary when they are used. Furthermore the graph is assumed to be connected for many of these calculations and the program may perform poorly if it is not. For precise measurements, we suggest sticking to "degree" ordering.
- Building PSL with 16 threads and BP turned off:
make -B psl NUM_THREADS=16 USE_BP=false
- Building DPSL with degree ordering and 32 BP roots
make -B dpsl ORDER_METHOD=degree N_ROOTS=32
The PSL binary takes a single argument which is the graph file to be used. The file could be in edge list or matrix market file formats.
./psl <graph_file>
mpirun --bind-to none -n <node_count> ./dpsl <graph_file> <part_file>
--bind-to none
: This part ensures that all the cores on the node are available to the program.-n <node_count>
: Node count is the number of nodes we want to run the program on. It should match the number of partitions on the<part_file>
.<graph_file>
: Same as PSL.<part_file>
: Each line of this file should contain a single integer which should be the partition id for the corresponding vertex. Outputs frommetis
ormtmetis
should work fine here. As mentioned before the number of partitions should match the number of nodes.
Note that this might not work well with the COMPRESS option turned on, as the load balance will be thrown off due to the removal of some vertices. When used with the COMPRESS option we suggest Usage 2.
mpirun --bind-to none -n <node_count> ./dpsl <graph_file> <partitioner> <partition_config>
--bind-to none
: This part ensures that all the cores on the node are available to the program.-n <node_count>
: Node count is the number of nodes we want to run the program on. It should much the number of partitions on the<part_file>
.<graph_file>
: Same as PSL.<partitioner>
: A partitioner name. Currently "pulp", "metis", "mtmetis" and "mtkahypar" are supported.<partition_config>
: Config file for mtkahypar, is not needed for the other partitioners. Note that certain options may be invalid.
We had some trouble with pulp and metis on some runs. The problem is likely to be with the code handling their integration As a result, we suggest using mtmetis and mykahypar whenever possible.
We used data from SuiteSparse and Network Repository for evaluation.
The scripts/download.py
script can be used to automatically download our dataset on most Linux systems.
Please run the script on the project root. Like:
python3 scripts/download.py
.