WebGraph: A Java repository from lhelwerd

How to set up and use the WebGraph framework

In order to compile the WebGraph framework and set it up so that it can be run,
the following steps should be followed:

- Download and extract Apache Maven binaries from
http://maven.apache.org/download.cgi
- Download the dependencies tarball from http://webgraph.di.unimi.it and
extract it.
- Retrieve the WebGraph framework source code with extensions from this
repository. This version contains the copy list and copy flags compression
formats, as well as additional flags for the other compression schemes.
- Compile the JAR file of the framework using Maven by running "mvn install"
within the WebGraph root directory.
- Copy the target/webgraph-3.4.2.jar file it to the same location as the JAR
files from the dependencies.

In order to use Maven on a computer where it is not yet in the PATH, one
can run mvn.sh provided in the repository. This sets up a link to the Maven
root directory. Note that it only keeps the path setting for the current
session by default. Instructions on adding the paths to a .bashrc file, for
example, are given by the script. Otherwise, one should either run the script
in a bash with ". ./mvn.sh" and run mvn afterwards, or with "./mvn.sh install"
(i.e., in place of mvn).

The framework has different modes of operations that allow compressing,
decompressing and testing graph files in different formats. For example, the
following command recompresses a dataset with other parameters:
java -cp "*" it.unimi.dsi.webgraph.BVGraph -o -m 1 ../sets/uk-2002 ../sets/uk-fast

This command creates an offsets file and alters the maximum reference chain
length of the given dataset.

The following parameters can be used to alter the implemented compression
algorithms:
- -m determines the maximum length of a reference chain that we can make
during reference compression. -1 is essentially the same as allowing any
length of reference chains, while 0 skips reference compression, which is
used by copy list, copy blocks and copy flags compression.
- -w is the window size. 0 skips all reference compression, which is used by
copy list, copy blocks and copy flags compression.
- -i is interval minimum length. 0 skips interval compression.
- -r is residual compression. This is a boolean parameter, where the value
0 skips gaps compression, and 1 performs gaps compression, which is the
default.
- -b determines which copy compression is used. 0 is copy list, 1 is copy
blocks (default) and 2 is copy flags. This is only used if the maximum
reference length (-m) and window size (-w) are not 0.

As we can see, we can set certain parameters to 0 to make sure that their
respective compression schemes are not used. We can therefore test the algorithms separately. To be specific, we have the following parameters for the
given compression formats:

- No compression algorithm: -m 0 -w 0 -i 0 -r 0
- Gaps compression: -m 0 -w 0 -i 0
- Interval compression: -m 0 -w 0 -r 0
- Copy blocks: -i 0 -r 0 -b 1
- Copy list: -i 0 -r 0 -b 0
- Copy flags: -i 0 -r 0 -b 2

For instance, this command creates a graph file with only gaps compression:
java -cp "*" it.unimi.dsi.webgraph.BVGraph -m 0 -w 0 -i 0 ../sets/uk-2002 ../sets/uk-gaps

The compressed files are binary files, with special representations for numeric
values that improve compression compared to plain text formats. The framework
can also decompress the graphs and store them in a readable format. There are
two major plain text formats that the framework supports. The first plain text
format is the naive adjacency list format:
java -cp "*" it.unimi.dsi.webgraph.ASCIIGraph -g BVGraph ../sets/uk-2002 ../sets/uk-raw

The second format is a naive listing of pairs of edges:
java -cp "*" it.unimi.dsi.webgraph.ArcListASCIIGraph -g BVGraph ../sets/uk-2002 ../sets/uk.edges

One can also import a graph from these flat files, by swapping around the
"ASCIIGraph" or "ArcListASCIIGraph" class with the "BVGraph" class. This
requires that those files are sorted correctly and do not contain non-edge
data such as comments. For example:
java -cp "*" it.unimi.dsi.webgraph.BVGraph -g ArcListASCIIGraph ../sets/huge.txt.e ../sets/huge
Here, one can again add specific compression flags to tune which compression
and which settings to use. Note that this will almost always giv increased
compression since it is no longer stored as ASCII text but as binary codes,
skewing the comparison.

The WebGraph framework also provides a speed test module, which has been
adapted to use CPU time instead of wall-clock time. The speed test has two
different modes in which it can operate. By default, the speed test performs
sequential testing, which times how long it takes to expand the whole
compressed graph. The following command can be used to run a sequential speed
test:
java -cp "*" it.unimi.dsi.webgraph.test.SpeedTest -g BVGraph ../sets/uk-gaps

We can also perform random access testing. This mode can be started by giving
the parameter -r. This parameter requires a sample size as value, which
determines how many randomly chosen nodes we use in the test in order to
measure and average the random access times. This command can be used to run
a random access speed test:
java -cp "*" it.unimi.dsi.webgraph.test.SpeedTest -g BVGraph -r 100000 ../sets/uk-gaps

Finally, if the WebGraph framework runs out of memory, one can increase the
Java heap size using the -Xmx command line flag and the Java thread stack size
using -Xss.

lhelwerd/WebGraph