/ldbc_snb_datagen

LDBC Social Network Benchmark DATAGEN

Primary LanguageJavaGNU General Public License v3.0GPL-3.0

LDBC_LOGO

LDBC-SNB Data Generator

Build Status Codacy Badge

The LDBC-SNB Data Generator (Datagen) is the responsible of providing the data sets used by all the LDBC benchmarks. This data generator is designed to produce directed labeled graphs that mimic the characteristics of those graphs of real data. A detailed description of the schema produced by Datagen, as well as the format of the output files, can be found in the latest version of official LDBC SNB specification document.

ldbc_snb_datagen is part of the LDBC project. ldbc_snb_datagen is GPLv3 licensed, to see detailed information about this license read the LICENSE.txt file.

Quick start

There are three main ways to run Datagen: (1) using a pseudo-distributed Hadoop installation, (2) running the same setup in a Docker image, (3) running on a distributed Hadoop cluster.

Pseudo-distributed Hadoop node

wget http://archive.apache.org/dist/hadoop/core/hadoop-2.6.0/hadoop-2.6.0.tar.gz
tar xf hadoop-2.6.0.tar.gz
export HADOOP_CLIENT_OPTS="-Xmx2G"
# set this to the Hadoop 2.6.0 directory
export HADOOP_HOME=
# set this to the repository's directory
export LDBC_SNB_DATAGEN_HOME=
cd $LDBC_SNB_DATAGEN_HOME
./run.sh

Docker image

The image can be simply built with the provided Dockerfile. To build, execute the following command from the repository directory:

docker build . --tag ldbc/datagen

Configuration

To configure the amount of memory available, set the HADOOP_CLIENT_OPTS variable in the Dockerfile. The default value is -Xmx8G.

Running

In order to run the container, a params.ini file is required. For reference, please see the params*.ini files in the repository. The file will be mounted in the container by the --mount type=bind,source="$(pwd)/params.ini,target="/opt/ldbc_snb_datagen/params.ini" option. If required, the source path can be set to a different path.

The container outputs its results in the /opt/ldbc_snb_datagen/out/ directory which contains two sub-directories, social_network/ and subsitution_parameters. In order to save the results of the generation, a directory must be mounted in the container from the host. The driver requires the results be in the datagen repository directory. To generate the data, run the following command which includes changing the owner (chown) of the Docker-mounted volumes:

docker run --rm --mount type=bind,source="$(pwd)/",target="/opt/ldbc_snb_datagen/out" --mount type=bind,source="$(pwd)/params.ini",target="/opt/ldbc_snb_datagen/params.ini" ldbc/datagen && \
  sudo chown -R $USER:$USER social_network/ substitution_parameters/

If you need to raise the memory limit, use the -e HADOOP_CLIENT_OPTS="-Xmx..." parameter to override the default value (-Xmx8G).

Hadoop cluster

Instructions are currently not provided. (TBD)

Community provided tools