This repo contains demo code from the 2022 GDS February Webinar - "Graph Data Science for Really Big Data". The exact pattern here may vary slightly from what you have seen in the webinar, most of the commands have been placed in notebooks for example, but the overall steps should be the same.
The purpose of this demo is to explore engineering graph features using Neo4j and the Graph Data Science (GDS) Library on a larger dataset to see if we can improve accuracy for a classification problem.
The graph used here is the MAG240M OGB Large-Scale-Challenge Graph. It is a heterogeneous academic paper graph that contains around 240 Million Nodes and 1.7 Billion Relationships.
This demo walks through multiple steps including running a reference model before using graph, formatting and importing data into Neo4j, analyzing the graph and engineering graph features with GDS, and exporting data to re-run a model with those graph features.
The demo here is ultimately split up into 8 parts, 7 of which are ipython notebooks. Hopefully the file names are descriptive as to what they cover
-
Parts 1 and 2 focus on understanding the data and running a classification model with available features before leveraging Neo4j/GDS/graph
part1-get-data-and-non-graph-modeling-prep.ipynb
part2-simple-non-graph-model-and-pca.ipynb
-
Parts 3-5 are focused on pre-formatting the data and importing into graph
part3-prepare-papers-for-import.ipynb
part4-prepare-authors-and-inst-for-import.ipynb
part5-admin-import.md
-
Part 6 and 7 focus on work in Neo4j and GDS. Part 6 is mostly inspecting the graph and demoing native projections and the WCC algorithm. Part 7 is focused on actually generating and exporting graph features (FastRP Node Embeddings)
part6-analysis-in-neo4j-gds.ipynb
part7-graph-feature-engineering-in-gds.ipynb
-
Finally Part 8 re-runs the classification model with the graph features (FastRP Node Embeddings). In this very rough exploratory first pass we get an ~9% point increase in classification accuracy.
part8-graph-feature-model.ipynb
- Neo4j = Enterprise Edition 4.4.3
- GDS = Enterprise Edition 1.8.3
- APOC = 4.4.0.3
- Python = 3.9.7
Important Note: Enterprise (as opposed to Community) Editions were used for both the Neo4j Database and GDS library in this demo. The use of GDS Enterprise, in particular, provides high-concurrency and optimized in-memory compression which are not available in Community Edition and key to performance at these scales.
This demo was run on a single AWS ec2 x1.16xlarge instance (64 vCPUs, 976 GB Memory).
I tweaked a few things but the below are the most critical which you can update in the neo4j settings/configuration (a.k.a neo4j.conf
)
dbms.memory.heap.max_size=760G
gds.export.location=/data/neo-export
# or set to whatever directory you would like data exports from Neo4j to go
Depending on your environment and specific needs you may need to tune this and other configuration like min heap size, pagecache, etc. For more details on optimizing Neo4j configuration for data science and analytics at scale I recommend looking into the Graph Data Science Configuration Guide.
This demo was just a rough first pass to explore what is possible. There are many ways to improve upon this analysis! Here are just a few areas to experiment:
- Improved tuning of FastRP node Embeddings
- Inclusion of more graph features
- Streamlined data formatting and ETL
- Better-tuned and/or more sophisticated classification models and frameworks
- Exploration of Semi-supervised transductive approaches to label the rest of the papers, such as Label Propagation Algorithm (LPA) or K-Nearest Neighbor (KNN)