LDBC SNB Datagen (Spark-based)

The LDBC SNB Data Generator (Datagen) is responsible for providing the datasets used by the LDBC Social Network Benchmark's workloads. The generator is designed to produce directed labelled graphs that mimic the characteristics of those graphs of real data. A detailed description of the schema produced by Datagen, as well as the format of the output files, can be found in the latest version of official LDBC SNB specification document.

📜 If you wish to cite the LDBC SNB, please refer to the documentation repository.

⚠️ There are two different versions of the Datagen:

The Hadoop-based Datagen generates the Interactive SF1-1000 data sets.
For the BI workload, use the Spark-based Datagen (in this repository).
For the Interactive workloads's larger data sets, there is no out-of-the-box solution (see this issue).

Generated small data sets are deployed by the CI.

Quick start

Build the JAR

You can build the JAR with both Maven and SBT.

To assemble the JAR file with Maven, run:
```
./tools/build.sh
```
For faster builds during development, consider using SBT. To assemble the JAR file with SBT, run:
```
sbt assembly
```
⚠️ When using SBT, change the path of the JAR file in the instructions provided in the README (target/ldbc_snb_datagen_${PLATFORM_VERSION}-${DATAGEN_VERSION}.jar -> ./target/scala-2.12/ldbc_snb_datagen-assembly-${DATAGEN_VERSION}.jar).

Install Python tools

Some of the build utilities are written in Python. To use them, you have to create a Python virtual environment and install the dependencies.

E.g. with pyenv and pyenv-virtualenv:

pyenv install 3.7.7
pyenv virtualenv 3.7.7 ldbc_datagen_tools
pyenv local ldbc_datagen_tools
pip install -U pip 
pip install ./tools

Running locally

The ./tools/run.py script is intended for local runs. To use it, download and extract Spark as follows.

Spark 3.1.x

Spark 3.1.x is the recommended runtime to use. The rest of the instructions are provided assuming Spark 3.1.x.

To place Spark under /opt/:

curl https://downloads.apache.org/spark/spark-3.1.2/spark-3.1.2-bin-hadoop3.2.tgz | sudo tar -xz -C /opt/
export SPARK_HOME=/opt/spark-3.1.2-bin-hadoop3.2
export PATH="$SPARK_HOME/bin":"$PATH"

To place under ~/:

curl https://downloads.apache.org/spark/spark-3.1.2/spark-3.1.2-bin-hadoop3.2.tgz | tar -xz -C ~/
export SPARK_HOME=~/spark-3.1.2-bin-hadoop3.2
export PATH="$SPARK_HOME/bin":"$PATH"

Both Java 8 and Java 11 are supported.

Once you have Spark in place and built the JAR file, run the generator as follows:

export PLATFORM_VERSION=2.12_spark3.1
export DATAGEN_VERSION=0.5.0-SNAPSHOT
./tools/run.py ./target/ldbc_snb_datagen_${PLATFORM_VERSION}-${DATAGEN_VERSION}.jar <runtime configuration arguments> -- <generator configuration arguments>

Runtime configuration arguments

The runtime configuration arguments determine the amount of memory, number of threads, degree of parallelism. For a list of arguments, see:

./tools/run.py --help

To generate a single part-*.csv file, reduce the parallelism (number of Spark partitions) to 1.

./tools/run.py ./target/ldbc_snb_datagen_${PLATFORM_VERSION}-${DATAGEN_VERSION}.jar --parallelism 1 -- --format csv --scale-factor 0.003 --mode interactive

Generator configuration arguments

The generator configuration arguments allow the configuration of the output directory, output format, layout, etc.

To get a complete list of the arguments, pass --help to the JAR file:

./tools/run.py ./target/ldbc_snb_datagen_${PLATFORM_VERSION}-${DATAGEN_VERSION}.jar -- --help

Generating CsvBasic files in Interactive mode:

./tools/run.py ./target/ldbc_snb_datagen_${PLATFORM_VERSION}-${DATAGEN_VERSION}.jar -- --format csv --scale-factor 0.003 --explode-edges --explode-attrs --mode interactive

Generating CsvCompositeMergeForeign files in BI mode resulting in compressed .csv.gz files:

./tools/run.py ./target/ldbc_snb_datagen_${PLATFORM_VERSION}-${DATAGEN_VERSION}.jar -- --format csv --scale-factor 0.003 --mode bi --format-options compression=gzip

Generating CSVs in raw mode:

./tools/run.py ./target/ldbc_snb_datagen_${PLATFORM_VERSION}-${DATAGEN_VERSION}.jar -- --format csv --scale-factor 0.003 --mode raw --output-dir sf0.003-raw

For the interactive and bi formats, the --format-options argument allows passing formatting options such as timestamp/date formats, the presence/abscence of headers (see the Spark formatting options for details), and whether quoting the fields in the CSV required:
```
./tools/run.py ./target/ldbc_snb_datagen_${PLATFORM_VERSION}-${DATAGEN_VERSION}.jar -- --format csv --scale-factor 0.003 --mode interactive --format-options timestampFormat=MM/dd/y\ HH:mm:ss,dateFormat=MM/dd/y,header=false,quoteAll=true
```

To change the Spark configuration directory, adjust the SPARK_CONF_DIR environment variable.

A complex example:

export SPARK_CONF_DIR=./conf
./tools/run.py ./target/ldbc_snb_datagen_${PLATFORM_VERSION}-${DATAGEN_VERSION}.jar --parallelism 4 --memory 8G -- --format csv --format-options timestampFormat=MM/dd/y\ HH:mm:ss,dateFormat=MM/dd/y --explode-edges --explode-attrs --mode interactive --scale-factor 0.003

It is also possible to pass a parameter file:

./tools/run.py ./target/ldbc_snb_datagen_${PLATFORM_VERSION}-${DATAGEN_VERSION}.jar -- --format csv --param-file params.ini

Docker image

The Docker image can be built with the provided Dockerfile. To build, execute the following command from the repository directory:

./tools/docker-build.sh

See Build the JAR to build the library (e.g. by invoking ./tools/build.sh). Then, run the following:

./tools/docker-run.sh

Elastic MapReduce

We provide scripts to run Datagen on AWS EMR. See the README in the ./tools/emr directory for details.

Larger scale factors

The scale factors SF3k+ are currently being fine-tuned, both regarding optimizing the generator and also for tuning the distributions.

Graph schema

The graph schema is as follows:

Troubleshooting

When running the tests, they might throw a java.net.UnknownHostException: your_hostname: your_hostname: Name or service not known coming from org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal. The solution is to add an entry of your machine's hostname to the /etc/hosts file: 127.0.1.1 your_hostname.
If you are using Docker and Spark runs out of space, make sure that Docker has enough space to store its containers. To move the location of the Docker containers to a larger disk, stop Docker, edit (or create) the /etc/docker/daemon.json file and add { "data-root": "/path/to/new/docker/data/dir" }, then sync the old folder if needed, and restart Docker. (See more detailed instructions).
If you are using a local Spark installation and run out of space in /tmp (java.io.IOException: No space left on device), set the SPARK_LOCAL_DIRS to point to a directory with enough free space.

wx-csy/ldbc_snb_datagen_spark