/sri-reference-kg

Repo for generating the SRI Reference KG

Primary LanguageMakefileBSD 3-Clause "New" or "Revised" LicenseBSD-3-Clause

SRI Reference KG

This repository contains the workflow for generating the SRI Reference Knowledge Graph, a combination of the integrated Monarch KG and other relevant data sources.

The purpose of this KG is to serve the following communities,

  • NCATS Biomedical Data Translator
  • National COVID Cohort Consortium (N3C)
  • Illuminating the Druggable Genome
  • KG-COVID-19

There are several ways of building the graph.

  • Read Dipper N-Triples and translate to Biolink Model
  • Read SciGraph Neo4j and translate to Biolink Model

Read Dipper N-Triples and translate to Biolink Model

This can be achieved by parsing the N-Triples through KGX.

The transform.yaml lists all the sources that are transformed as part of this workflow. Each source has its own specific properties to facilitate the parsing of the N-Triples by KGX.

The transform.yaml can be used to generate a set of TSVs for each source in the KGX interchange format.

The merge.yaml lists all the sources in TSV format (as generated by KGX) which are used in the merge process to generate an integrated KG.

Getting all the required datasets

First create a folder called data:

mkdir data && cd data

Then download all the required N-Triples to the data folder:

wget -r -nd "https://archive.monarchinitiative.org/@DATA_VERSION@/rdf/blcategories/"

Where @DATA_VERSION@ must be replaced with a proper data version from archive.monarchinitiative.org

Also be sure to get Monarch Ontologies in OBOGraph JSON form:

wget https://ci.monarchinitiative.org/view/pipelines/job/monarch-ontology-json-sri/lastSuccessfulBuild/artifact/build/monarch-ontology-sri-translator.json

And ChEBI in OBOGraph JSON form:

wget http://kg-hub.berkeleybop.io/frozen_incoming_data/chebi.json.gz

Then, compress all the files in the data folder:

pigz -p 2 -9r *

Installing dependencies

First set up a virtual environment, note that the kgx merge step requires python >= 3.8

# create a new virtual environment
python3.8 -m venv env

# active the virtual environment
source env/bin/activate

Then install the dependencies listed in requirements.txt,

pip install -r requirements.txt

Running the workflow

There is a Makefile that runs the following workflow,

  • Transform all Monarch N-Triples to KGX TSVs using kgx transform
  • Load all KGX TSVs and merge into a single graph using kgx merge
  • Create a Neo4j Docker container and load the merged graph into Neo4j using kgx neo4j-upload
  • Compress the Neo4j data directory into an archive

To run the workflow,

make all

The Makefile relies on a set of arguments that drives the behavior of the Makefile with the following defaults:

DATA_DIR=data
OUTPUT_DIR=data-parsed
PROCESSES=1
NEO4J_DATA_DIR=`pwd`/neo_data
SUFFIX=build
DATA_VERSION=202009
KG_VERSION=0.3.0

To override the defaults,

make all SUFFIX=build_20201021 PROCESSES=4 DATA_DIR=monarch-data OUTPUT_DIR=sri-reference-kg-0.3.0 KG_VERSION=0.3.0

Note: To ensure that the pipeline runs end-to-end, you would need a machine that has at least 8 cores of CPU and 100GB in memory.