Contributors: Daryl Lim, Christy Chan
Course: COMP96037 Advanced Databases, Imperial College London
The objective of this coursework is to practice the interplay between storage and processing in the context of a not-quite relational application: counting triangles in a graph.
Triangle counting is one of the core challenges in structural graph analytics. The idea is to find the number of sets of nodes that are connected in a triangle. For this coursework, we make the task slightly more interesting by considering a directed, labeled graph (see Figure).
In this graph, we might be looking for triangles with the edge label sequence
0, 1, 2
and find the triangles 1, 5, 6
and 4, 1, 3
.
Depending on the “connectedness” (the number of edges divided by the number of vertices in the graph), this problem can be more or less challenging (we will use 32 as a connectedness factor).
While this problem could be considered a graph problem (or sparse linear algebra), it is a join problem as well (three joins, actually). Here is a representation of the problem in SQL
Create the graph table:
create table Edges(from int, to int, label int);
Insert some edges:
insert into Edges values(0,1,1);
insert into Edges values(4,1,0);
insert into Edges values(1,2,0);
insert into Edges values(1,3,1);
insert into Edges values(3,2,2);
insert into Edges values(1,5,0);
insert into Edges values(1,6,2);
insert into Edges values(5,6,1);
insert into Edges values(3,4,2);
insert into Edges values(2,7,1);
insert into Edges values(7,2,2);
insert into Edges values(3,7,1);
insert into Edges values(3,8,2);
insert into Edges values(8,4,1);
insert into Edges values(4,7,2);
The triangle query would then be
select from Edges as firstEdge, Edges as secondEdge, Edges as thirdEdge
where const firstEdge.to = secondEdnge.from and firstEdge.label = 0
and secondEdnge.to = thirdEdge.from and secondEdge.label = 1
and thirdEdge.to = firstEdge.from and thirdEdge.label = 2;
There will also be cases in which we deal with a graphs that are being modified between the counting of the triangles. Specifically, we insert the data in batches of 1/8th of the overall dataset, count the triangles, delete some edges and repeat.
The code implements the insertion, querying and deletion of the graph in C using both, a sort-merge join as well as a hashjoin.
To run the program:
git clone ${your repository URL}
cd ${your repository name}
You may want to set up two separate build directories for the code, one for debugging and one for benchmarking. Here is how you could do that:
mkdir Debug
cd Debug
cmake -DCMAKE_BUILD_TYPE=Debug ..
cd ..
mkdir Release
cd Release
cmake -DCMAKE_BUILD_TYPE=Release ..
cd ..
You can compile each by (respectively) typing:
cmake --build Debug
or
cmake --build Release
Note that the first time you build each of these will take a long time since it also builds dependencies.
To run the tests, simply run
./Debug/tests
a successful run output should look like this (pass -? for more options)
===============================================================================
All tests passed (30 assertions in 3 test cases)
To run the benchmarks, simply run
./Release/Benchmarks
if you want to restrict the benchmarks that are being run you can use, for example
./Release/Benchmarks --benchmark_filter='GraphQueryBenchmark<HashjoinImplementation>/64/32'
(64 is the number of nodes in the graph, 32 the average number of edges)
./Benchmarks --benchmark_list_tests
gives you a name of experiments (try ./Benchmarks --help
for more options).