Graphs databases are a powerful way to represent real world data in a simple and intuitive manner They can effectively capture inherent relationships within the data and provide meaningful insights that cannot be obtained using traditional relational databases.
Neo4j is a leading graph database platform that offers great capabilities for storing and querying large scale enterprise data and can be easily scaled up to accomodate millions of nodes without hindering the performance. Moreover, it has great community support and a large number of plugins available for carrying out various tasks. Head over to their official website
Machine Learning on graph data has been the talk of the town for quite a while now. With the advantage of using graphs being quite evident; applying machine learning algorithms on graphs can be used for tasks such as graph analysis, link prediction, clustering etc.
Graph Embeddings are a way to encode the graph data as vectors that can effectively capture the structural information, such as the graph topology and the node to node relationships in the graph database. These embeddings can then be ingested by ML algorithms for performing various tasks
Graph embeddings can be used to perform various tasks including machine learning tasks. For example, embeddings of two nodes can be used to determine if a relationship can exist between them. Or, given a particular node and a relation, embeddings can be used to find similar nodes and rank them using similarity search algortihms Common applications include knowledge graph completion and drug discovery where new relations can be dicovered between two nodes. Link prediction and Recommendation systems in cases such as social networks analysis where potential new friendships can be found.
PyEmbeo is a project in python that creates graph embeddings for a Neo4j graph database. Link to the neo4j database can be passed to the script through a command line interface to generate graph embeddings. Other parameters (such as the number of epochs for training) can be configured by creating or editing the "config.yml" file. (See config_link for all the configurable parameters). The obtained embeddings can be then used to perform other tasks such as similarity search, scoring or ranking. (Note: currently the similarity search task has been implemented, other tasks are still in development)
-
Neo4j database and py2neo
-
conda (or miniconda)
-
python >=3.5
Also, ensure that the APOC plugin for Neo4j is installed and configured for your database. Make sure following lines are added to the 'neo4j.conf' file:
apoc.import.file.enabled=true
.
apoc.export.file.enabled=true
.
dbms.security.procedures.whitelist=apoc.*
.
apoc.import.file.use_neo4j_config=false
.
- Clone the repository using and navigate inside the directory :
git clone <link>
cd ./PyEmbeo
- create a conda environment and activate it by running:
conda env create -f requirements.yml
- This creates a conda environment called pyembeo and installs all the requirements. Activate the environment by exceuting:
conda activate pyembeo
PyEmbeo uses torchbiggraph to generate graph embeddings. PyTorch-BigGraph is a tool can create graph embeddings for very large, multi-realtional graphs without the need for computing resources such as GPUs. For more details,you can refer to the PyTorch-BigGraph documentation
The script uses the config.yml file to configure all the training parameters. The file has been preconfigured with default parameters and only a minimal set of parameter need to be passed through the command line. However, the parameters can be tweaked by editing the config.yml file.
The command line interface takes the following parameters:
-
project_name : This is the root directory that will store the required data and embedding checkpoint files.
-
url : The url to the neo4j database in the format of bolt(or http): // (ip of the database):(port number). By default the url is configured to bolt://localhost:7687
You will be then prompted to enter the username and password to connect to the database.
- config_path: This is an optional parameter that specifies the path to a 'config.yml' file incase the default parameters are edited.
To get all the parameters execute:
python embed.py --help
To launch the training script for creating graph embeddings execute the following command from the project directory:
python embed.py train --project_name=sampleproject --url=bolt://localhost:7687
This will create a folder called as sampleproject in the current directory which will store all the data and checkpoint files required.
Once the training is done, the embeddings will be save to sampleproject/model directory
A common task using graph embeddings is performing similarity search to return similar nodes which can then be used to find undiscovered relationships.
PyEmbeo uses FAISS that is used for fast similarity searching for a large number of vectors. A similarity search can be triggered by passing the node id of a particular node (any even any other property can also be passed but it will be computationally heavy) More Details can be found at: official documentation or this post and this post
the similarity search script takes similar arguments like the training script along with a few extra ones:
-
project_name : This is the root directory that will store the required data and embedding checkpoint files.
-
url : The url to the neo4j database in the format of bolt(or http): // (ip of the database):(port number). By default the url is configured to bolt://localhost:7687
-
node: This specifies the node id of any node present in the graph.
To get all the parameters execute:
python task.py --help
The script first creates faiss indexes if they are not alread created and then returns n similar nodes for the given node(default n = 5 )
To execute the similarity search task, exceute the following command from the project directory:
python task.py similarity --project_name=sampleproject --node=1234 --url=bolt://localhost:7687/
A root directory with the name given by the --project_name argument is created along with its subfolders: |-- my_project_name/ .
|------ data/ .
|---------- graph_partitioned/ .
|----------------- egdes.h5 files .
|---------- files related to the nodes (.json,.txt,.tsv files) .
|------ model/
|---------- index/
|----------------- .index files
|---------- config.json
|---------- embeddings files (.h5 and .txt files)
|------ metadata.json
data/ : stores all the data related files such as
- entity_names (.json) stores list of the node ids of the entities
- entity count (.txt) store the total count of entites
- graph.tsv stores the graph data in tsv format which is used as an input for training graph embeddings
- graph_partitioned/ edges (.h5) files store the edge list
model/: stores the checkpoint and embeddings files created during training.
- config.json is a configuration file that is created using the config.yml file which is used by torchbiggraph for trainig
- embeddings (.h5) store the graph embeddings
- checkpoint_version(.txt) stores the latest checkpoint version of the embeddings
metadata.json stores data aboout the number of nodes, labels and types of relationships
Default parameters can be overridden by editing or creating a config.yml file. Most of the parameters are used by torchbiggraph and more details about each can be found at :......... Some of the editable paramters list includes:
-
EMBEDDING_DIMENSIONS: size of the embedding vectors. defaults to 400
-
EPOCHS: number of training iterations to perform (defaults to 20)
-
NUM_PARTITIONS : the number of partitions to divide the nodes into. This is used in torchbiggraph which will divide the nodes of a particular type. (defaults to 1)
torchbiggraph uses the concept of operators and comparators for scoring while training the graph embeddings. More details can be found at: comparators and operators
- operator : can be 'none','diagonal','translation','complex_diagonal', 'affine' or 'linear' . Defaults to 'complex_diagonal'
- comparator :can be 'dot','cos','l2','squared_l2'. Defaults to 'dot'
The similarity search parameters can also be tweaked accordingly:
- FAISS_INDEX_NAME: The type of index to use for similarity searching . Defaults to IndexIVFFlat. Currently only the IVFFlat and FlatL2 index types are supported . see index types for details on type of indexes
- NEAREST_NEIGHBORS: number of similar nodes to return. Defaults ti 5
- NUM_CLUSTER: number of clusters that are created by the clustering algorithm while creating the index