grape: A Jupyter Notebook repository from pnrobinson

GraPE

GraPE (Graph Processing and Embedding) is a fast graph processing and embedding library, designed to scale with big graphs and to run on both off-the-shelf laptop and desktop computers and High Performance Computing clusters of workstations.

The library is written in Rust and Python programming languages, and has been developed by [AnacletoLAB](https://anacletolab.di.unimi.it/) (Dept.of Computer Science of the University of Milan), in collaboration with the [Robinson Lab - Jackson Laboratory for Genomic Medicine](https://www.jax.org/research-and-faculty/research-labs/the-robinson-lab) and with the [BBOP - Lawrence Berkeley National Laboratory](http://www.berkeleybop.org/index.html)

GraPE is composed of two main modules: Ensmallen (ENabler of SMALL runtimE and memory Needs) and Embiggen (EMBeddInG GENerator), that run synergistically using parallel computation and efficient data structures.

Ensmallen efficiently executes graph processing operations including large-scale first and second-order random walks, and Embiggen leverages the large amount of sampled random walks generated by Ensmallen to effectively compute node and edge embeddings that can be used for unsupervised exploratory analysis of graphs or to train flexible neural models provided by Embiggen itself or other Machine Learning models for solving edge and node label prediction problems.

Main functionalities of the library

TO DO

Installation of GraPE

As usual, just download it using pip:

pip install grape

Tutorials

You can find tutorials covering various aspects of the GraPE library here. All tutorials are as self-contained as possible and can be immediately executed on COLAB.

If you want to get started real quick, after having installed GraPE from Pypi as described above, you can try running the following SkipGram on Cora example:

from ensmallen.datasets.linqs import Cora
from ensmallen.datasets.linqs.parse_linqs import get_words_data
from embiggen.pipelines import compute_node_embedding
from embiggen.visualizations import GraphVisualization
import matplotlib.pyplot as plt

# Dowload, load up the graph and its node features
graph, node_features = get_words_data(Cora())

# Compute a SkipGram node embedding, using a second-order random walk sampling
node_embedding, training_history = compute_node_embedding(
    graph,
    node_embedding_method_name="SkipGram",
    # Let's increase the probability of explore the local neighbourhood
    return_weight=2.0,
    explore_weight=0.1
)

# Visualize the obtained node embeddings
visualizer = GraphVisualization(graph, node_embedding_method_name="SkipGram")
visualizer.fit_transform_nodes(node_embedding)

visualizer.plot_node_types()
plt.show()

You can see a tutorial detailing the above script here, and you can run it on COLAB from here.

Documentation

Currently the documentation website of the library is being developed.

Using the automatic method suggestions utility

To make getting started with the Ensmallen library easier, we provide an integrated recommendere system meant to help you either find a method or, if a method has been renamed for any reason, find its new name.

Let's suppose you are using the STRING Homo Sapiens graph, and you'd like to compute its connected components. You could reasonably think that, if there is such a method, it will likely contain terms relative to components, so after having loaded up the graph you could try to execute the following:

from ensmallen.datasets.string import HomoSapiens

graph = HomoSapiens()
graph.components

The code above will raise the following error, hopefully leading you to find the correct method to do what you intended to do.

AttributeError                            Traceback (most recent call last)
<ipython-input-3-52fac30ac7f6> in <module>()
----> 2 graph.components

AttributeError: The method 'components' does not exists, did you mean one of the following?
* 'remove_components'
* 'connected_components'
* 'strongly_connected_components'
* 'get_connected_components_number'
* 'get_total_edge_weights'
* 'get_mininum_edge_weight'
* 'get_maximum_edge_weight'
* 'get_unchecked_maximum_node_degree'
* 'get_unchecked_minimum_node_degree'
* 'get_weighted_maximum_node_degree'

So the method we want to compute the connected components would be connected_components.

Now, in order to get the method documentation, the easiest method is to use Python's [help](https://docs.python.org/3/library/functions.html#help) as follows:

help(graph.connected_components)

And the above will return you:

connected_components(verbose) method of builtins.Graph instance
Compute the connected components building in parallel a spanning tree using [bader's algorithm](https://www.sciencedirect.com/science/article/abs/pii/S0743731505000882).

**This works only for undirected graphs.**

The returned quadruple contains:
- Vector of the connected component for each node.
- Number of connected components.
- Minimum connected component size.
- Maximum connected component size.

Parameters
----------
verbose: Optional[bool]
    Whether to show a loading bar or not.


Raises
-------
ValueError
    If the given graph is directed.
ValueError
    If the system configuration does not allow for the creation of the thread pool.

You can try to run the code described above on COLAB.

Cite GraPE

Please cite the following paper if it was useful for your research:

TODO: add bibtex reference here to copy