The project is using the following datasets for testing.
Structure of these datasets can be found here.
Note: The neo4j queries can be found in ML notebooks which have been used many times in the GraphSAGE notebooks as well.
Link prediction is one of the most important research topics in the field of graphs and networks. The objective of link prediction is to identify pairs of nodes that will either form a link or not in the future.
Given a citation network system in which different authors have collaborated with each other in the past. Our task is to find the links that can be formed with the authors in the future. (i.e. they are co-authors)
For storing the data, we are using neo4j
(in both ML as well as Deep Learning Techniques). Cypher
is used for data manipulation.
For connectivity with the python Data Science Ecosystem, py2neo
is used.
For installing neo4j instance on Linux VM, you can follow this.
Here we are using the following techniques to measure similarity measures to get an idea about the structure and topology of the Graph Network as well.
- Common Neighbours
- Preferential Attachment
- Total Neighbours
- Triangle Completion and Clustering Coefficients
- Label Propagation
- Louvain Algorithm
The scores obtained from these techniques can be used alone to determine the links for future. For better performance, these features can be fed into some ML model to obtain the results. We are using Random Forest Classifier for the purpose (which will act as a binary classifier).
The dataset is very large to process on a whole. So, we are using subsets of the data to perform our tasks. (Notebooks with different sets of data have been added.)
Notebooks for the above techniques
We will be leveraging power of Graph Neural Networks to achieve Link Predictions in our co-author Graph. We are using GraphSAGE
implemented in stellargraph
library for this.
The following notebooks has been added (Citation v11 used):
- Without LDA (Latent Dirichlet Allocation for topic modelling)
- With LDA (larger feature set)
- Different metrics
- Author Recommendation
- Weighted
[1] Graph Algorithms: Practical Examples in Apache Spark and Neo4j, By Mark Needham & Amy E. Hodler
[2] GraphSAGE: Inductive Representation Learning on Large Graphs