Disease ontologies for knowledge graphs

Data integration to build a drug discovery knowledge graph is a challenge. There are multiple disease ontologies used in data sources and publications. Each disease ontology has its hierarchy, and the common task is to map ontologies, to find disease clusters, finally to build your representation of the disease area.

Here we present a knowledge graph solution that uses disease ontologies cross-references and allows easy switch between ontology hierarchies for data integration purpose, as well as to perform other tasks.

Load diseases into Grakn using prepared database

Unzip ./data/db.zip file => ./data/db
Use Grakn docker image with an extrenal volume

docker run -d -v ./data/db/:/grakn-core-all-linux/server/db/ -p 48555:48555 graknlabs/grakn

# Check using Grakn console
grakn console -k dokg
match $d isa disease, has efo-id "EFO_0009425", has disease-id $di; get;

or use local Grakn install with data-dir pointing to ./data/db

# Change "data-dir" in Grakn configuration file: data-dir=<full path>/data/db/:
vi /usr/local/Cellar/grakn-core/1.8.0/libexec/server/conf/grakn.properties

Grakn configuration file

Load diseases into Grakn using python scripts

Code was tested Grakn 1.8, python 3.6

Build python environment:

conda create --name graknenv python=3.6
conda activate graknenv
pip3 install grakn-client

Install Grakn
Load schema

grakn server start
grakn console -k dokg -f ./scripts/schema.gql

Load diseases

conda activate graknenv
python3 ./scripts/load_script.py

Load MONDO hierarchy

add_hierarchy.py has one parameter: ontology_name

python3 ./scripts/add_hierarchy.py MONDO
python3 ./scripts/add_hierarchy.py EFO

Load MESH hierarchy

MESH is not our primary ontology (we don't have all parental terms of it). Parental terms by default are loaded for DOID, EFO and MONDO. So, in order to load MESH hierarchy we have to add parental terms first:

python3 ./scripts/add_terms.py MESH
python3 ./scripts/add_hierarchy.py MESH

Usage examples

To get all disease ontologies ids for the disease of interest (e.g. "chronic kidney disease"):

grakn console -k dokg
match $d isa disease, has disease-name "chronic kidney disease", has disease-id $di; get;

There are two types of relations for disease-hierarchy: "disease-hierarchy" for hierarchical relation directly loaded from ontologies and "disease-hierarchy-inferred" for hierarchical relation both loaded from ontologies and inferred using Grakn logical reasoning.

To get direct children of "chronic kidney disease" using EFO ontology id ("EFO_0003884") and MONDO ontology hierarchy:

grakn console -k dokg
match $x isa disease, has efo-id 'EFO_0003884'; $o isa ontology, has ontology-name 'MONDO'; $dh (superior-disease: $x, subordinate-disease: $y, $o)  isa disease-hierarchy; $y isa disease, has disease-name $dn; get $dn;

To get all children of "chronic kidney disease" using EFO ontology id ("EFO_0003884") and MONDO ontology hierarchy:

grakn console -k dokg
match $x isa disease, has efo-id 'EFO_0003884'; $o isa ontology, has ontology-name 'MONDO'; $dh (superior-disease: $x, subordinate-disease: $y, $o)  isa disease-hierarchy-inferred; $y isa disease, has disease-name $dn; get $dn;

To get all children of "chronic kidney disease" using EFO ontology id ("EFO_0003884") regardless the hierarchy:

grakn console -k dokg
match $x isa disease, has efo-id 'EFO_0003884'; $o isa ontology; $dh (superior-disease: $x, subordinate-disease: $y, $o)  isa disease-hierarchy-inferred; $y isa disease, has disease-name $dn; get $dn;

We can get all parents of "chronic kidney disease" using EFO ontology id ("EFO_0003884") regardless the hierarchy:

grakn console -k dokg
match $y isa disease, has efo-id 'EFO_0003884'; $o isa ontology; $dh (superior-disease: $x, subordinate-disease: $y, $o)  isa disease-hierarchy-inferred; $x isa disease, has disease-name $dn; get $dn;

We can get mappings from every Mondo-ID to a Mesh-ID

python3 ./scripts/ontology_mapping_example.py

Data colllections

/data/prepared_ontologies/

Collected cross references, prepared hierarchies and additional parental terms:

cross-references.tsv
DOID_prepared_hierarchy.tsv
EFO_prepared_hierarchy.tsv
MONDO_prepared_hierarchy.tsv
Orphanet_prepared_hierarchy.tsv
Orphanet_additional_classes.tsv
HP_additional_classes.tsv
HP_prepared_hierarchy.tsv
MESH_additional_classes.tsv
MESH_prepared_hierarchy.tsv
NCIT_additional_classes.tsv
NCIT_prepared_hierarchy.tsv

cross-references.tsv - main file that contains cross-references of terms from different disease ontologies.

Statistics for cross-references.tsv file

	MESH	UMLS	EFO	NCIT	OMIM	DOID	Orphanet	HP	MONDO	ICD10	Total
# of terms only in this ontology	0	0	7	0	0	0	1	80	109	0	1186
# of preferred terms	0	0	70	24	0	5	69	75	21453	0	21696
# of references	8328	17648	4930	7067	8056	9001	9066	652	21482	11271	97501
# of unique references	8251	17591	4930	7067	8032	9001	9066	652	21482	4103	90175

Data preparation

Code was tested R version 3.6.1

R libraries:

rols
data.table

Extract hierarchy and prepare additional parental classes if not present in cross-references.tsv file

source("./scripts/data_preparation.R")
cross_references_file <- "./data/prepared_ontologies/cross_references.tsv"
hierarchy_file <- "./data/bioportal_export/DOID.csv"
ontology_name <- "DOID"
bioportal_ontological_hierarchy_preparation(cross_references_file, hierarchy_file, ontology_name)

Check for loops in cross-references.tsv file

To use if you update cross-references.tsv file

source("./scripts/data_preparation.R")
cross_references_file <- "./data/prepared_ontologies/cross_references.tsv"
cross_references_validity(cross_references_file)

How to use this system with multiple related ontologies in a domain other than disease.

The user will have to do the data preparation part from scratch.

Download/get original ontology hierarchy files.
Create cross-references file in the CSV format as shown in Github repository.
Make sure that cross-references are atomic, for example by using provided R scripts (see Data preparation section).
Load the data into an empty Grakn schema using python scripts (the process is described above)

The names in the schema are specific to the disease-oriented knowledge graph. Appropriate changes might be needed to reduce the possibility of confusion when a user is weighing queries.

natacourby/Disease_ontologies_for_knowledge_graphs