COMP96037 Advanced Databases 2020 Coursework

Contributors: Daryl Lim, Christy Chan

Course: COMP96037 Advanced Databases, Imperial College London

The objective of this coursework is to practice the interplay between storage and processing in the context of a not-quite relational application: counting triangles in a graph.

Triangle counting

Triangle counting is one of the core challenges in structural graph analytics. The idea is to find the number of sets of nodes that are connected in a triangle. For this coursework, we make the task slightly more interesting by considering a directed, labeled graph (see Figure).

./TrianglesVisualized.png

In this graph, we might be looking for triangles with the edge label sequence 0, 1, 2 and find the triangles 1, 5, 6 and 4, 1, 3.

Depending on the “connectedness” (the number of edges divided by the number of vertices in the graph), this problem can be more or less challenging (we will use 32 as a connectedness factor).

The relational view

While this problem could be considered a graph problem (or sparse linear algebra), it is a join problem as well (three joins, actually). Here is a representation of the problem in SQL

Create the graph table:

create table Edges(from int, to int, label int);

Insert some edges:

insert into Edges values(0,1,1);
insert into Edges values(4,1,0);
insert into Edges values(1,2,0);
insert into Edges values(1,3,1);
insert into Edges values(3,2,2);
insert into Edges values(1,5,0);
insert into Edges values(1,6,2);
insert into Edges values(5,6,1);
insert into Edges values(3,4,2);
insert into Edges values(2,7,1);
insert into Edges values(7,2,2);
insert into Edges values(3,7,1);
insert into Edges values(3,8,2);
insert into Edges values(8,4,1);
insert into Edges values(4,7,2);

The triangle query would then be

select from Edges as firstEdge, Edges as secondEdge, Edges as thirdEdge
where const firstEdge.to = secondEdnge.from and firstEdge.label = 0
and secondEdnge.to = thirdEdge.from and secondEdge.label = 1
and thirdEdge.to = firstEdge.from and thirdEdge.label = 2;

Modifications

There will also be cases in which we deal with a graphs that are being modified between the counting of the triangles. Specifically, we insert the data in batches of 1/8th of the overall dataset, count the triangles, delete some edges and repeat.

The task

The code implements the insertion, querying and deletion of the graph in C using both, a sort-merge join as well as a hashjoin.

Getting started

To run the program:

git clone ${your repository URL}
cd ${your repository name}

You may want to set up two separate build directories for the code, one for debugging and one for benchmarking. Here is how you could do that:

mkdir Debug
cd Debug
cmake -DCMAKE_BUILD_TYPE=Debug ..
cd ..
mkdir Release
cd Release
cmake -DCMAKE_BUILD_TYPE=Release ..
cd ..

You can compile each by (respectively) typing:

cmake --build Debug

or

cmake --build Release

Note that the first time you build each of these will take a long time since it also builds dependencies.

Testing

To run the tests, simply run

./Debug/tests

a successful run output should look like this (pass -? for more options)

===============================================================================
All tests passed (30 assertions in 3 test cases)

Benchmarking

To run the benchmarks, simply run

./Release/Benchmarks

if you want to restrict the benchmarks that are being run you can use, for example

./Release/Benchmarks --benchmark_filter='GraphQueryBenchmark<HashjoinImplementation>/64/32'

(64 is the number of nodes in the graph, 32 the average number of edges)

./Benchmarks --benchmark_list_tests

gives you a name of experiments (try ./Benchmarks --help for more options).