/GraphZip

Mining graph streams using dictionary-based compression

Primary LanguagePythonMIT LicenseMIT

GraphZip: Mining Graph Streams using Dictionary-based Compression

GraphZip is a scalable method for mining interesting patterns in graph streams, based on the Lempel-Ziv class of compression algorithms.

This repository contains a reference implementation of the GraphZip algorithm, written in Python.

For more detailed information regarding the algorithm, see our paper.

Install

Python 3 and python-igraph are required.

Once you have a working version of Python 3, install python-igraph with:

$ pip install python-igraph

Note: python-igraph is not to be confused with the igraph package (renamed to jgraph).

Usage

Run GraphZip directly from the command line with:

$ python graphzip.py graph_file [-n NUM_FILES] [-d] [-a ALPHA] [-t THETA] [-o OUTFILE]

Examples

# Run GraphZip on `test.graph` with a batch size of `5` and a dictionary size of `10`:
python graphzip.py test.graph -a 5 -t 10

# Run GraphZip on files `1.graph` through `100.graph` located in directory `test_graphs/`, using a batch size of 5 and the default dictionary size:
python graphzip.py test_graphs -n 100 -a 5

Options

-n, --num_files

Using -n turns on multi-file mode, and GraphZip will treat graph_file as a directory holding NUM_FILES sequential graph stream files, labelled 1.graph to [NUM_FILES].graph.

-d, --directed

Using -d will make GraphZip read the graph file(s) as directed graphs.

-a, --alpha
-t, --theta

Batch size and dictionary size, respectively (hyperparameters of the GraphZip model).

-o, --outfile

By default, the pattern dictionary is dumped to stdout; use -o to save the dictionary to a specified file.

File format

The correct format for .graph files is:

% '%' to start comments
v [vertex id (int)] [vertex label (int)]
e [source vertex id (int)] [target vertex id (int)] [edge label (int)]

For example:

% example 3-clique with vertex labels "100", "999" and edge labels "1", "2", "3"
v 1 100
v 2 999
v 3 100
e 1 2 1
e 1 3 2
e 2 3 3

Vertices must be defined prior to being referenced by an edge (or the vertex label would be unknown).

In the case of processing a graph stream over sequential .graph files, having Compressor's label_history_per_file property set to False and add_implicit_vertices set to True (the default) allows edges to reference vertices declared in previous files. For further details see the inline comments.

Experiments

Several example experiments are located in the unit tests directory under tests/test_expr.py.

Run the algorithm on several example graphs (located in data/) by navigating to the root project directory then running:

$ python -m unittest

To run a specific test use the format:

$ python -m unittest test.[test_file].[TestSuite].[test_function]

For example, to run the 4-clique test with 20% coverage located in the TestGraphZipSubgen suite:

$ python -m unittest test.test_expr.TestGraphZipSubgen.test_4PATH_20

For details on specific examples, see the files under the test/ directory.

Citations

If you find GraphZip useful in your research, please consider citing:

C. Packer and L. Holder, GraphZip: Mining Graph Streams using Dictionary-based Compression. SIGKDD Workshop on Mining and Learning in Graphs (MLG), August 2017.

@inproceedings{PacHol:mlg17,
  title     = {{GraphZip}: Mining Graph Streams using Dictionary-based Compression},
  author    = {Charles Packer and Larry Holder},
  booktitle = {Proceedings of the 13th International Workshop on Mining and Learning with Graphs (MLG)},
  year      = {2017}
}

License

MIT License - see LICENSE.txt for details.