/trillion-graph

A scale demo of Neo4j Fabric spanning up to 1129 machines/shards running a 100TB (LDBC) dataset with 1.2tn nodes and relationships.

Primary LanguageJavaApache License 2.0Apache-2.0

Discord Discourse users

Demo application instructions

Overview

This repository contains the code necessary to reproduce the results for the Trillion Entity demonstration that was part of the NODES 2021 Keynote presentation. It contains the store generation code we used, the orchestration scripts for the AWS instances that are needed to run the setup, the queries we executed, and the client that performs the latency measurements. Please read this README in its entirety before proceeding, to make sure you have an understanding of the necessary steps.

More Information

Blog post with more behind the scenes information Behind the Scenes of Creating the World’s Biggest Graph Database.

The NODES 2021 Keynote recording showing the Trillion Graph Demo live:

A twitter thread summary of the demo:

How To

What you'll need:

  1. An AWS account with sufficient capacity for the number and type of EC2 instances you'll create, including access to S3. AWS is the default provider this application uses; it should be possible to modify it to use the cloud provider of your choice.
  2. Access to Neo4j Enterprise. Fabric is a Neo4j Enterprise feature, which is distributed under a different license. It needs to be properly installed to your local Maven repository and you can find detailed instructions in the Neo4j Documentation

The directory structure is as follows:

  1. cypher contains the individual cypher queries that were used in the demo
  2. server contains the data generation code and the instance orchestration
  3. client contains the client for the latency measurements
  4. guide contains a Neo4j Browser guide which explains the LDBC schema and queries

Outline

Here we'll describe the basic steps you'll need to take. Detailed instructions are provided further down.

Familiarize yourself with the code.

The code provided should be straightforward to understand. You should take some time to familirize yourself with it, since you'll need to provide information specific to your environment. The main two files to look at are the FabricDataGenerator and AmazonController that you can find under the server directory. The first creates the stores both locally and remotely, and the second orchestrates the AWS Neo4j instances. They are structured as scripts, so you can modify them as you like. You will need to edit the code to execute the various steps and configure the setup to your requirements.

Create the stores

You should first create the Person and Template databases. The first is the full Person shard and the latter is the basis for the Forum shards. Typically, you will create these two locally, upload them to S3, and then orchestrate EC2 instances with the AmazonController to generate en mass Forum shards. Of course, with minimal changes, you can do everything locally, in one step, and then move the databases to the Fabric shards however you prefer.

Instantiate the Shards

The AmazonController class can be used to install and configure Neo4j and the shards. You will need to modify the code to execute the appropriate commands for your setup, but the basic AWS orchestration steps will be the same as for the store generation.

Build, install and run the application

The last step is to locally build and run the UI for the demo. With that, you'll be able to take latency measurements and explore the schema you built.