Parallel BFS with CUDA

This projects aims to implement Breadth First Search Algorithm on CUDA which would outperform simple sequential implementation.

Nvidia paper: https://research.nvidia.com/publication/scalable-gpu-graph-traversal (Merrill, Garland, Grimshaw)

Goals:

Create Graph class and test it
Implement sequential BFS and test it
Implement (simple) parallel BFS with CUDA and test it
Implement Merill parallel BFS with CUDA and test it
Performance comparison and conclusions

Parallelization strategy

Simple, sequential Breadth First Search has O(|V| + |E|) complexity - we visit every vertex exactly once and every edge at most once. Therefore, it would be desirable for our parallel algorithm to have the same asymptotic complexity.

At first, we will consider quadratic, but sometimes quite effective algorithm. After that, we will consider simple parallelization strategy that has the above complexity. In each BFS level it simply traverses all vertexes in the queue in parallel and creates a new queue. New vertices are added using atomicAdd operation on the position in the output queue in each iteration.

Then, we will try to imlement more sophisticated algorithm. As described in Merill et. al our strategy is following:

expand adjacent neighbors in parallel
implement out-of-vore edge and vertex frontiers
use local prefix sum in place of local atomic operations for determining enqueue offsets
use a best-effort bitmask for efficient neighbor filtering

Requirements:

gtest (https://github.com/google/googletest) for testing

Running instructions:

Compile the program with ./compile
Run the program inputing data source. For example ./bfs_main < data/4elt.graph

The source code has options to read graphs provided both as AdjacencyList and list of edges and this is specified by Format argument in a constructor.

Credits