BioGrakn

A Knowledge Graph-based Semantic Database for Biomedical Sciences

Introduction

BioGrakn is a graph-based semantic database that takes advantage of the power of knowledge graphs and machine reasoning to solve problems in the domain of biomedical science.

Biograkn has been built on top of GRAKN.AI, a distributed knowledge graph database which allows complex data modeling, verification, scaling, querying, and analysis.

For further information, you can refer to the paper "BioGrakn: A Knowledge Graph-based Semantic Database for Biomedical Sciences", presented at CISIS-2017, or to this article on DZone.

Obtain the software

You can find a ready-to-use binary version of the software here.

Choose a work directory and be sure to download the .jar and the two .gql files containing the ontology and the inference rules.

Data sources download

BioGrakn is built by integrating data available from several databases, such as NCBI Entrez Gene, Gene Ontology, Uniprot Knowledge Base, Reactome, and others.

For your convenience, all the used datasources have been collected and they are available for download from the URL http://194.119.214.173/biograkn/.

Create a destination directory on your computer, e.g., ~/datasources, and then copy the downloaded files into it. Alternatively, use a preferred method of yours to download all the files together. For example, with wget:

 $ wget -A .bz2 -r -nd -nv -P ~/datasources http://194.119.214.173/biograkn/

Now, uncompress the files:

 $ bunzip2 ~/datasources/*

Load the ontology and the inference rules

With GRAKN.AI up and running, load the ontology and the inference rules:

 $ cd [YOUR-GRAKN-1.2.0-DIR]
 $ ./graql console -k biograkn -f [WORKDIR]/ontology.gql
 $ ./graql console -k biograkn -f [WORKDIR]/rules.gql
 $ cd [WORKDIR]

Note that you can use an ad-hoc keyspace, such as biograkn used above.

Import data

The data import process is handled by the java program BuildBioGrakn contained in the previously downloaded .jar file.

Its usage is briefly explained by running it with -h command line option:

 $ java -jar BuildBioGrakn.jar -h
 usage: BuildBioGrakn
 -d <arg>   data source path
 -h         print this help
 -k <arg>   keyspace

With no options, the program will use the following default values

name	value
data directory	~/datasources
keyspace	biograkn

The output should be as follow (except for the execution time, obviously!):

$ java -jar BuildBioGrakn.jar 

Building BioGrakn ...

Importing NCBI Gene ........................ done
Importing Gene Ontology ....................... done
Importing Gene2GO ......................... done
Importing miRBase .............................. done
Importing Reactome ..................... done
Importing Reactome2GO ............ done
Importing Reactome2miRNA ............ done
Importing miRCancer .............. done
Importing Uniprot ......................... done
Importing Uniprot2Reactome ......... done
Importing HGNC ........................ done
Importing miRNASNP ............ done
Importing miRTarBase .................... done

BioGrakn built in 0 hours 31 minutes 21 seconds

Now, if you open the web visualizer, querying for "all types" will show something like this:

Play with BioGrakn

This section shows how to start playing with BioGrakn. Two sample queries are reported:

Search for genes linked to a particular Gene Ontology annotation
Search for pathways linked to a particular gene

Search for genes linked to a particular Gene Ontology annotation

Let's consider the Gene Ontology annotation ”glicocerol ether metabolic process”, that has GO:0006662 as identifier.

In order to find annotated genes, the annotation relation, with the functional annotation member equal to our starting identifier, points out all the related annotated entities, from which we extract the genes, printing their symbols and names.

The following Graql query returns the desired results:

	match $go has goId "GO:0006662";
		(functionalAnnotation: $go, annotatedEntity: $gene); 
		$gene isa gene; get;

Search for pathways linked to a particular gene

At a first sight, this seems like the previous problem.

However, genes cannot be directly linked to pathways, because Reactome just provides pathway-to-proteins associations.

Therefore, we have to go through two relations: encoding, that links genes to proteins, and containing, that links pathways to proteins.

The Graql query is then formed as follow:

(to be completed)

xMAnton/BioGrakn