The LDBC-SNB Data Generator (Datagen) is the responsible of providing the data sets used by all the LDBC benchmarks. This data generator is designed to produce directed labeled graphs that mimic the characteristics of those graphs of real data. A detailed description of the schema produced by Datagen, as well as the format of the output files, can be found in the latest version of official LDBC SNB specification document.
ldbc_snb_datagen
is part of the LDBC project.
ldbc_snb_datagen
is GPLv3 licensed, to see detailed information about this license read the LICENSE.txt
file.
There are three main ways to run Datagen: (1) using a pseudo-distributed Hadoop installation, (2) running the same setup in a Docker image, (3) running on a distributed Hadoop cluster.
wget http://archive.apache.org/dist/hadoop/core/hadoop-2.6.0/hadoop-2.6.0.tar.gz
tar xf hadoop-2.6.0.tar.gz
export HADOOP_CLIENT_OPTS="-Xmx2G"
# set this to the Hadoop 2.6.0 directory
export HADOOP_HOME=
# set this to the repository's directory
export LDBC_SNB_DATAGEN_HOME=
cd $LDBC_SNB_DATAGEN_HOME
./run.sh
The image can be simply built with the provided Dockerfile. To build, execute the following command from the repository directory:
docker build . --tag ldbc/datagen
To configure the amount of memory available, set the HADOOP_CLIENT_OPTS
variable in the Dockerfile. The default value is -Xmx8G
.
In order to run the container, a params.ini
file is required. For reference, please see the params*.ini
files in the repository. The file will be mounted in the container by the --mount type=bind,source="$(pwd)/params.ini,target="/opt/ldbc_snb_datagen/params.ini"
option. If required, the source path can be set to a different path.
The container outputs its results in the /opt/ldbc_snb_datagen/out/
directory which contains two sub-directories, social_network/
and subsitution_parameters
. In order to save the results of the generation, a directory must be mounted in the container from the host. The driver requires the results be in the datagen repository directory. To generate the data, run the following command which includes changing the owner (chown
) of the Docker-mounted volumes:
docker run --rm --mount type=bind,source="$(pwd)/",target="/opt/ldbc_snb_datagen/out" --mount type=bind,source="$(pwd)/params.ini",target="/opt/ldbc_snb_datagen/params.ini" ldbc/datagen && \
sudo chown -R $USER:$USER social_network/ substitution_parameters/
If you need to raise the memory limit, use the -e HADOOP_CLIENT_OPTS="-Xmx..."
parameter to override the default value (-Xmx8G
).
Instructions are currently not provided. (TBD)
- Apache Flink Loader: A loader of LDBC datasets for Apache Flink.