Roger is an automated graph data curation pipeline.
The first workflow transforms Knowledge Graph eXchange (KGX) files into a graph database in phases:
- get: Fetch KGX files from a repository.
- merge: Merge duplicate nodes accross multiple KGX files.
- schema: Infer the schema properties of nodes and edges.
- bulk create: Format for bulk load to Redisgraph.
- bulk load: Load into Redisgraph
- validate: Execute test queries to validate the bulk load.
Requires Python 3.7+, Docker, and Make.
Also requires KGX fork with Redisgraph Transformer.
$ git clone https://github.com/stevencox/kgx
$ git clone <this repo>
$ cd <this repo>
$ pip install requirements.txt
$ bin/roger all
Roger can also be run via a Makefile:
cd bin
make clean install validate
You can quickly set up the required dependencies and spin up all the necessary services with:
make install
make stack
Without using make, you can run the necessary commands directly on the shell:
mkdir -p {logs,plugins}
mkdir -p local_storage/elastic
docker-compose up airflow-init
docker-compose up
Roger's is designed to transform data through well defined and transparent phases.
In general, each phase
- Reads and writes a set of files.
- Manages data in a single, configurable, root directory.
Configuration is at roger/config.yaml.
Roger can load Redisgraph
- By running the RedisgraphTransformer (currently on a fork of KGX)
- Currently, this is very slow.
- By bulk loading Redisgraph
To build a bulk load, we
- Ensure no duplicate nodes exist
- Preserve all properties present across duplicate nodes
- Ensure all nodes of the same type have exactly the same properties
- Generate a comprehensive header (schema) for all nodes and edges These constraints are managed in the steps below.
Fetches KGX files according to a data version selecting the set of files to use.
Merges nodes duplicated across files aggregating properties from all nodes
Identify and record the schema (properties) of every edge and node type. Schema records the type resolved for each property of a node/edge. The Schema step generates category schema file for node schema and predicate schema for edges. In these files properties are collected and scoped based on type of the edges and nodes found. For instances where properties do not have consistent data type across a given scope, the following rule is used to resolve to final data type:
- If the property has fluctuating type among a boolean, a float or an Integer in the same scope, it's final data type would be a string.
- If conflicting property is ever a string but never a list in the scope, it's final data type will be string.
- If conflicting property is ever a list , it's final data type will be a list.
Using this approach attributes will be casted based on the resolution set here when loading to the graph database in subsequent steps.
Create bulk load CSV files conforming to the Redisgraph Bulk Loader's requirements.
Bulk create uses the Schema generated in Schema step to generate csv headers
(redis csv headers) with
the assumed types . Currently redis bulk loader requires every column to have a value.
To address this issue, this step groups the entities being processed (edges/nodes)
based on attributes that have values. Then these groups are written into separate csv files. Nodes
are written as csv(s) under <roger-data-dir>/bulk/nodes
and edges under <roger-data-dir>/bulk/edges
.
Each csv with these folders has the following naming convention
<entity-type>.csv-<group_index>-<uniqueness-index>
.
When populating the CSV with values, the appropriate casting is done on the properties to normalize
them to the data types defined in the Schema step.
Use the bulk loader to load Redisgraph logging statistics on each type of loaded object.
Runs a configurable list of queries with timing information to quality check the generated graph database.
Roger uses Redisgraph's new bulk loader which is available in the 'edge' tagged Docker image.
You can run the container like this and use it immediately
docker run -p 6379:6379 -it --rm --name redisgraph redislabs/redisgraph:edge
or run it with /bin/bash
at the end to get a shell like this:
docker run -p 6379:6379 -it --rm --name redisgraph redislabs/redisgraph:edge /bin/bash
This lets you have a look around inside the container. To start Redis with the graph database plugin:
# redis-server --loadmodule /usr/lib/redis/modules/redisgraph.so
A clean Roger build looks like this. Times below are on a Macbook Air.
This can be run in the bin directory as
$ make clean install validate
Or via the roger CLI
$ ../bin/roger all
[roger][core.py][ get] DEBUG: wrote data/kgx/chembio_kgx-v0.1.json: edges: 21637 nodes: 8725 time: 13870
[roger][core.py][ get] DEBUG: wrote data/kgx/chemical_normalization-v0.1.json: edges: 277030 nodes: 72963 time: 15455
[roger][core.py][ get] DEBUG: wrote data/kgx/cord19-phenotypes-v0.1.json: edges: 24 nodes: 25 time: 392
[roger][core.py][ get] DEBUG: wrote data/kgx/ctd-v0.1.json: edges: 48363 nodes: 24008 time: 7143
[roger][core.py][ get] DEBUG: wrote data/kgx/foodb-v0.1.json: edges: 5429 nodes: 4536 time: 1974
[roger][core.py][ get] DEBUG: wrote data/kgx/mychem-v0.1.json: edges: 123119 nodes: 5496 time: 12271
[roger][core.py][ get] DEBUG: wrote data/kgx/pharos-v0.1.json: edges: 287750 nodes: 224349 time: 40150
[roger][core.py][ get] DEBUG: wrote data/kgx/topmed-v0.1.json: edges: 63860 nodes: 15870 time: 10901
real 1m58.722s
user 1m4.472s
sys 0m4.625s
[roger][core.py][ merge] INFO: merging data/kgx/chembio_kgx-v0.1.json
[roger][core.py][ merge] DEBUG: merged data/kgx/chemical_normalization-v0.1.json load: 1377 scope: 60 merge: 39
[roger][core.py][ merge] DEBUG: merged data/kgx/cord19-phenotypes-v0.1.json load: 118 scope: 38 merge: 0
[roger][core.py][ merge] DEBUG: merged data/kgx/ctd-v0.1.json load: 1151 scope: 26 merge: 19
[roger][core.py][ merge] DEBUG: merged data/kgx/foodb-v0.1.json load: 141 scope: 24 merge: 1
[roger][core.py][ merge] DEBUG: merged data/kgx/mychem-v0.1.json load: 1763 scope: 9 merge: 5
[roger][core.py][ merge] DEBUG: merged data/kgx/pharos-v0.1.json load: 8426 scope: 218 merge:126
[roger][core.py][ merge] DEBUG: merged data/kgx/topmed-v0.1.json load: 873 scope: 194 merge: 4
[roger][core.py][ merge] INFO: data/kgx/chembio_kgx-v0.1.json rewrite: 1323. total merge time: 62921
[roger][core.py][ merge] INFO: merge data/merge/chemical_normalization-v0.1.json is up to date.
[roger][core.py][ merge] INFO: merge data/merge/cord19-phenotypes-v0.1.json is up to date.
[roger][core.py][ merge] INFO: merge data/merge/ctd-v0.1.json is up to date.
[roger][core.py][ merge] INFO: merge data/merge/foodb-v0.1.json is up to date.
[roger][core.py][ merge] INFO: merge data/merge/mychem-v0.1.json is up to date.
[roger][core.py][ merge] INFO: merge data/merge/pharos-v0.1.json is up to date.
[roger][core.py][ merge] INFO: merge data/merge/topmed-v0.1.json is up to date.
real 1m8.211s
user 0m53.546s
sys 0m3.894s
[roger][core.py][ is_up_to_date] DEBUG: no targets found
[roger][core.py][ create_schema] DEBUG: analyzing schema of data/kgx/chembio_kgx-v0.1.json.
[roger][core.py][ create_schema] DEBUG: analyzing schema of data/kgx/chemical_normalization-v0.1.json.
[roger][core.py][ create_schema] DEBUG: analyzing schema of data/kgx/cord19-phenotypes-v0.1.json.
[roger][core.py][ create_schema] DEBUG: analyzing schema of data/kgx/ctd-v0.1.json.
[roger][core.py][ create_schema] DEBUG: analyzing schema of data/kgx/foodb-v0.1.json.
[roger][core.py][ create_schema] DEBUG: analyzing schema of data/kgx/mychem-v0.1.json.
[roger][core.py][ create_schema] DEBUG: analyzing schema of data/kgx/pharos-v0.1.json.
[roger][core.py][ create_schema] DEBUG: analyzing schema of data/kgx/topmed-v0.1.json.
[roger][core.py][ write_schema] INFO: writing schema: data/schema/predicate-schema.json
[roger][core.py][ write_schema] INFO: writing schema: data/schema/category-schema.json
real 0m46.205s
user 0m34.701s
sys 0m3.237s
[roger][core.py][ is_up_to_date] DEBUG: no targets found
[roger][core.py][ create] INFO: processing data/merge/chembio_kgx-v0.1.json
[roger][core.py][ write_bulk] INFO: --creating data/bulk/nodes/chemical_substance.csv
[roger][core.py][ write_bulk] INFO: --creating data/bulk/nodes/gene.csv
[roger][core.py][ write_bulk] INFO: --creating data/bulk/nodes/named_thing.csv
[roger][core.py][ write_bulk] INFO: --creating data/bulk/edges/directly_interacts_with.csv
[roger][core.py][ create] INFO: processing data/merge/chemical_normalization-v0.1.json
[roger][core.py][ write_bulk] INFO: --creating data/bulk/edges/similar_to.csv
[roger][core.py][ create] INFO: processing data/merge/cord19-phenotypes-v0.1.json
[roger][core.py][ write_bulk] INFO: --creating data/bulk/nodes/disease.csv
[roger][core.py][ write_bulk] INFO: --creating data/bulk/nodes/phenotypic_feature.csv
[roger][core.py][ write_bulk] INFO: --creating data/bulk/edges/has_phenotype.csv
[roger][core.py][ create] INFO: processing data/merge/ctd-v0.1.json
[roger][core.py][ write_bulk] INFO: --creating data/bulk/edges/treats.csv
[roger][core.py][ write_bulk] INFO: --creating data/bulk/edges/contributes_to.csv
[roger][core.py][ write_bulk] INFO: --creating data/bulk/edges/decreases_activity_of.csv
[roger][core.py][ write_bulk] INFO: --creating data/bulk/edges/decreases_molecular_interaction.csv
[roger][core.py][ write_bulk] INFO: --creating data/bulk/edges/increases_activity_of.csv
[roger][core.py][ write_bulk] INFO: --creating data/bulk/edges/increases_localization_of.csv
[roger][core.py][ write_bulk] INFO: --creating data/bulk/edges/increases_expression_of.csv
[roger][core.py][ write_bulk] INFO: --creating data/bulk/edges/decreases_response_to.csv
[roger][core.py][ write_bulk] INFO: --creating data/bulk/edges/increases_molecular_interaction.csv
[roger][core.py][ write_bulk] INFO: --creating data/bulk/edges/increases_degradation_of.csv
[roger][core.py][ write_bulk] INFO: --creating data/bulk/edges/affects_activity_of.csv
[roger][core.py][ write_bulk] INFO: --creating data/bulk/edges/decreases_localization_of.csv
[roger][core.py][ write_bulk] INFO: --creating data/bulk/edges/affects_localization_of.csv
[roger][core.py][ write_bulk] INFO: --creating data/bulk/edges/decreases_secretion_of.csv
[roger][core.py][ write_bulk] INFO: --creating data/bulk/edges/increases_secretion_of.csv
[roger][core.py][ write_bulk] INFO: --creating data/bulk/edges/affects_response_to.csv
[roger][core.py][ write_bulk] INFO: --creating data/bulk/edges/increases_response_to.csv
[roger][core.py][ write_bulk] INFO: --creating data/bulk/edges/increases_synthesis_of.csv
[roger][core.py][ write_bulk] INFO: --creating data/bulk/edges/increases_transport_of.csv
[roger][core.py][ write_bulk] INFO: --creating data/bulk/edges/increases_mutation_rate_of.csv
[roger][core.py][ write_bulk] INFO: --creating data/bulk/edges/affects_metabolic_processing_of.csv
[roger][core.py][ write_bulk] INFO: --creating data/bulk/edges/decreases_metabolic_processing_of.csv
[roger][core.py][ write_bulk] INFO: --creating data/bulk/edges/increases_metabolic_processing_of.csv
[roger][core.py][ write_bulk] INFO: --creating data/bulk/edges/decreases_degradation_of.csv
[roger][core.py][ write_bulk] INFO: --creating data/bulk/edges/affects_synthesis_of.csv
[roger][core.py][ write_bulk] INFO: --creating data/bulk/edges/decreases_molecular_modification_of.csv
[roger][core.py][ write_bulk] INFO: --creating data/bulk/edges/increases_molecular_modification_of.csv
[roger][core.py][ write_bulk] INFO: --creating data/bulk/edges/decreases_synthesis_of.csv
[roger][core.py][ write_bulk] INFO: --creating data/bulk/edges/decreases_expression_of.csv
[roger][core.py][ write_bulk] INFO: --creating data/bulk/edges/affects.csv
[roger][core.py][ write_bulk] INFO: --creating data/bulk/edges/decreases_stability_of.csv
[roger][core.py][ write_bulk] INFO: --creating data/bulk/edges/molecularly_interacts_with.csv
[roger][core.py][ write_bulk] INFO: --creating data/bulk/edges/affects_degradation_of.csv
[roger][core.py][ write_bulk] INFO: --creating data/bulk/edges/increases_uptake_of.csv
[roger][core.py][ write_bulk] INFO: --creating data/bulk/edges/decreases_mutation_rate_of.csv
[roger][core.py][ write_bulk] INFO: --creating data/bulk/edges/increases_stability_of.csv
[roger][core.py][ write_bulk] INFO: --creating data/bulk/edges/affects_expression_of.csv
[roger][core.py][ write_bulk] INFO: --creating data/bulk/edges/affects_secretion_of.csv
[roger][core.py][ write_bulk] INFO: --creating data/bulk/edges/decreases_uptake_of.csv
[roger][core.py][ write_bulk] INFO: --creating data/bulk/edges/affects_transport_of.csv
[roger][core.py][ write_bulk] INFO: --creating data/bulk/edges/decreases_transport_of.csv
[roger][core.py][ write_bulk] INFO: --creating data/bulk/edges/affects_uptake_of.csv
[roger][core.py][ create] INFO: processing data/merge/foodb-v0.1.json
[roger][core.py][ write_bulk] INFO: --creating data/bulk/edges/related_to.csv
[roger][core.py][ create] INFO: processing data/merge/mychem-v0.1.json
[roger][core.py][ write_bulk] INFO: --creating data/bulk/edges/causes_adverse_event.csv
[roger][core.py][ write_bulk] INFO: --creating data/bulk/edges/causes.csv
[roger][core.py][ write_bulk] INFO: --creating data/bulk/edges/Unmapped_Relation.csv
[roger][core.py][ create] INFO: processing data/merge/pharos-v0.1.json
[roger][core.py][ write_bulk] INFO: --creating data/bulk/edges/gene_associated_with_condition.csv
[roger][core.py][ create] INFO: processing data/merge/topmed-v0.1.json
[roger][core.py][ write_bulk] INFO: --creating data/bulk/nodes/cell.csv
[roger][core.py][ write_bulk] INFO: --creating data/bulk/nodes/molecular_activity.csv
[roger][core.py][ write_bulk] INFO: --creating data/bulk/nodes/anatomical_entity.csv
[roger][core.py][ write_bulk] INFO: --creating data/bulk/nodes/cellular_component.csv
[roger][core.py][ write_bulk] INFO: --creating data/bulk/nodes/biological_process.csv
[roger][core.py][ write_bulk] INFO: --creating data/bulk/edges/association.csv
[roger][core.py][ write_bulk] INFO: --creating data/bulk/edges/has_part.csv
[roger][core.py][ write_bulk] INFO: --creating data/bulk/edges/part_of.csv
real 1m7.897s
user 0m58.467s
sys 0m2.791s
[roger][core.py][ insert] INFO: bulk loading
nodes: ['data/bulk/nodes/gene.csv', 'data/bulk/nodes/molecular_activity.csv', 'data/bulk/nodes/phenotypic_feature.csv', 'data/bulk/nodes/cell.csv', 'data/bulk/nodes/biological_process.csv', 'data/bulk/nodes/chemical_substance.csv', 'data/bulk/nodes/cellular_component.csv', 'data/bulk/nodes/anatomical_entity.csv', 'data/bulk/nodes/named_thing.csv', 'data/bulk/nodes/disease.csv']
edges: ['data/bulk/edges/part_of.csv', 'data/bulk/edges/decreases_metabolic_processing_of.csv', 'data/bulk/edges/decreases_uptake_of.csv', 'data/bulk/edges/decreases_secretion_of.csv', 'data/bulk/edges/decreases_molecular_modification_of.csv', 'data/bulk/edges/increases_synthesis_of.csv', 'data/bulk/edges/causes_adverse_event.csv', 'data/bulk/edges/decreases_localization_of.csv', 'data/bulk/edges/decreases_stability_of.csv', 'data/bulk/edges/treats.csv', 'data/bulk/edges/affects_activity_of.csv', 'data/bulk/edges/increases_secretion_of.csv', 'data/bulk/edges/decreases_expression_of.csv', 'data/bulk/edges/affects_transport_of.csv', 'data/bulk/edges/Unmapped_Relation.csv', 'data/bulk/edges/affects_localization_of.csv', 'data/bulk/edges/increases_stability_of.csv', 'data/bulk/edges/decreases_activity_of.csv', 'data/bulk/edges/increases_response_to.csv', 'data/bulk/edges/causes.csv', 'data/bulk/edges/decreases_degradation_of.csv', 'data/bulk/edges/similar_to.csv', 'data/bulk/edges/decreases_synthesis_of.csv', 'data/bulk/edges/affects_expression_of.csv', 'data/bulk/edges/affects_uptake_of.csv', 'data/bulk/edges/has_part.csv', 'data/bulk/edges/affects_synthesis_of.csv', 'data/bulk/edges/affects_response_to.csv', 'data/bulk/edges/increases_molecular_interaction.csv', 'data/bulk/edges/increases_localization_of.csv', 'data/bulk/edges/increases_expression_of.csv', 'data/bulk/edges/increases_uptake_of.csv', 'data/bulk/edges/related_to.csv', 'data/bulk/edges/increases_mutation_rate_of.csv', 'data/bulk/edges/affects.csv', 'data/bulk/edges/decreases_transport_of.csv', 'data/bulk/edges/gene_associated_with_condition.csv', 'data/bulk/edges/directly_interacts_with.csv', 'data/bulk/edges/increases_metabolic_processing_of.csv', 'data/bulk/edges/molecularly_interacts_with.csv', 'data/bulk/edges/increases_degradation_of.csv', 'data/bulk/edges/affects_metabolic_processing_of.csv', 'data/bulk/edges/has_phenotype.csv', 'data/bulk/edges/decreases_response_to.csv', 'data/bulk/edges/decreases_molecular_interaction.csv', 'data/bulk/edges/increases_activity_of.csv', 'data/bulk/edges/association.csv', 'data/bulk/edges/affects_secretion_of.csv', 'data/bulk/edges/decreases_mutation_rate_of.csv', 'data/bulk/edges/contributes_to.csv', 'data/bulk/edges/increases_transport_of.csv', 'data/bulk/edges/increases_molecular_modification_of.csv', 'data/bulk/edges/affects_degradation_of.csv']
[roger][core.py][ insert] INFO: deleting graph test in preparation for bulk load.
[roger][core.py][ insert] INFO: no graph to delete
[roger][core.py][ insert] INFO: bulk loading graph: test
gene [####################################] 100%
17868 nodes created with label 'gene'
3 nodes created with label 'molecular_activity'
phenotypic_feature [####################################] 100%
3723 nodes created with label 'phenotypic_feature'
8 nodes created with label 'cell'
2 nodes created with label 'biological_process'
chemical_substance [####################################] 100%
252966 nodes created with label 'chemical_substance'
2 nodes created with label 'cellular_component'
12 nodes created with label 'anatomical_entity'
named_thing [####################################] 100%
25903 nodes created with label 'named_thing'
disease [####################################] 100%
9777 nodes created with label 'disease'
part_of [####################################] 100%
31532 relations created for type 'part_of'
24 relations created for type 'decreases_metabolic_processing_of'
26 relations created for type 'decreases_uptake_of'
192 relations created for type 'decreases_secretion_of'
13 relations created for type 'decreases_molecular_modification_of'
186 relations created for type 'increases_synthesis_of'
causes_adverse_event [####################################] 100%
66461 relations created for type 'causes_adverse_event'
39 relations created for type 'decreases_localization_of'
12 relations created for type 'decreases_stability_of'
treats [####################################] 100%
11485 relations created for type 'treats'
307 relations created for type 'affects_activity_of'
527 relations created for type 'increases_secretion_of'
decreases_expression_of [####################################] 100%
2791 relations created for type 'decreases_expression_of'
91 relations created for type 'affects_transport_of'
28 relations created for type 'Unmapped_Relation'
506 relations created for type 'affects_localization_of'
48 relations created for type 'increases_stability_of'
decreases_activity_of [####################################] 100%
240317 relations created for type 'decreases_activity_of'
762 relations created for type 'increases_response_to'
causes [####################################] 100%
46277 relations created for type 'causes'
69 relations created for type 'decreases_degradation_of'
similar_to [####################################] 100%
277030 relations created for type 'similar_to'
21 relations created for type 'decreases_synthesis_of'
259 relations created for type 'affects_expression_of'
19 relations created for type 'affects_uptake_of'
has_part [####################################] 100%
31532 relations created for type 'has_part'
42 relations created for type 'affects_synthesis_of'
1804 relations created for type 'affects_response_to'
1495 relations created for type 'increases_molecular_interaction'
119 relations created for type 'increases_localization_of'
increases_expression_of [####################################] 100%
4178 relations created for type 'increases_expression_of'
118 relations created for type 'increases_uptake_of'
related_to [####################################] 100%
5429 relations created for type 'related_to'
564 relations created for type 'increases_mutation_rate_of'
116 relations created for type 'affects'
17 relations created for type 'decreases_transport_of'
gene_associated_with_condition [####################################] 100%
36017 relations created for type 'gene_associated_with_condition'
directly_interacts_with [####################################] 100%
30826 relations created for type 'directly_interacts_with'
467 relations created for type 'increases_metabolic_processing_of'
49 relations created for type 'molecularly_interacts_with'
increases_degradation_of [####################################] 100%
3394 relations created for type 'increases_degradation_of'
337 relations created for type 'affects_metabolic_processing_of'
24 relations created for type 'has_phenotype'
904 relations created for type 'decreases_response_to'
513 relations created for type 'decreases_molecular_interaction'
increases_activity_of [####################################] 100%
12061 relations created for type 'increases_activity_of'
796 relations created for type 'association'
242 relations created for type 'affects_secretion_of'
1 relations created for type 'decreases_mutation_rate_of'
contributes_to [####################################] 100%
16172 relations created for type 'contributes_to'
153 relations created for type 'increases_transport_of'
54 relations created for type 'increases_molecular_modification_of'
24 relations created for type 'affects_degradation_of'
Construction of graph 'test' complete: 310264 nodes created, 826470 relations created in 268.800857 seconds
real 4m31.889s
user 2m50.070s
sys 0m4.201s
config:{
"username": "",
"password": "",
"host": "localhost",
"graph": "test",
"ports": {
"http": 6379
}
}
+-------------+
| b'COUNT(a)' |
+-------------+
| 310264 |
+-------------+
Cached execution 0.0
internal execution time 19.4111
[roger][core.py][ validate] INFO: Query count_nodes:Count Nodes ran in 39ms: MATCH (a) RETURN COUNT(a)
+-------------+
| b'COUNT(e)' |
+-------------+
| 826470 |
+-------------+
Cached execution 0.0
internal execution time 6.2872
[roger][core.py][ validate] INFO: Query count_edges:Count Edges ran in 14ms: MATCH (a)-[e]-(b) RETURN COUNT(e)
+-----------------+-------------------------------+
| b'a.category' | b'b.id' |
+-----------------+-------------------------------+
| ['named_thing'] | NCBIGene:5978 |
| ['named_thing'] | GO:0043336 |
| ['named_thing'] | CHEBI:24433 |
| ['named_thing'] | UBERON:0000178 |
| ['named_thing'] | TOPMED.VAR:phv00177354.v2.p10 |
| ['named_thing'] | TOPMED.VAR:phv00003307.v1.p10 |
| ['named_thing'] | TOPMED.VAR:phv00010123.v5.p10 |
...about 400 lines elided here...
| ['named_thing'] | TOPMED.VAR:phv00001046.v1.p10 |
| ['named_thing'] | TOPMED.VAR:phv00116572.v2.p2 |
| ['named_thing'] | TOPMED.VAR:phv00210333.v1.p1 |
| ['named_thing'] | TOPMED.VAR:phv00307964.v1.p1 |
| ['named_thing'] | TOPMED.VAR:phv00083411.v1.p3 |
+-----------------+-------------------------------+
Cached execution 0.0
internal execution time 656.6031
[roger][core.py][ validate] INFO: Query connectivity:TOPMED Connectivity ran in 725ms: MATCH (a { id : 'TOPMED.TAG:8' })--(b) RETURN a.category, b.id
+-----------------+--------------------------------+
| b'a.category' | b'b.id' |
+-----------------+--------------------------------+
| ['named_thing'] | TOPMED.TAG:64 |
| ['named_thing'] | TOPMED.STUDY:phs000007.v29.p10 |
+-----------------+--------------------------------+
Cached execution 0.0
internal execution time 123.0736
[roger][core.py][ validate] INFO: Query connectivity:TOPMED Connectivity ran in 129ms: MATCH (a { id : 'TOPMED.VAR:phv00000484.v1.p10' })--(b) RETURN a.category, b.id
+-----------------+--------------------------------+
| b'a.category' | b'b.id' |
+-----------------+--------------------------------+
| ['named_thing'] | TOPMED.TAG:30 |
| ['named_thing'] | TOPMED.STUDY:phs000007.v29.p10 |
+-----------------+--------------------------------+
Cached execution 0.0
internal execution time 111.434
[roger][core.py][ validate] INFO: Query connectivity:TOPMED Connectivity ran in 116ms: MATCH (a { id : 'TOPMED.VAR:phv00000487.v1.p10' })--(b) RETURN a.category, b.id
+-----------------+--------------------------------+
| b'a.category' | b'b.id' |
+-----------------+--------------------------------+
| ['named_thing'] | TOPMED.TAG:74 |
| ['named_thing'] | TOPMED.STUDY:phs000007.v29.p10 |
+-----------------+--------------------------------+
Cached execution 0.0
internal execution time 110.0168
[roger][core.py][ validate] INFO: Query connectivity:TOPMED Connectivity ran in 113ms: MATCH (a { id : 'TOPMED.VAR:phv00000496.v1.p10' })--(b) RETURN a.category, b.id
+-----------------+--------------------------------+
| b'a.category' | b'b.id' |
+-----------------+--------------------------------+
| ['named_thing'] | TOPMED.TAG:26 |
| ['named_thing'] | TOPMED.STUDY:phs000007.v29.p10 |
+-----------------+--------------------------------+
Cached execution 0.0
internal execution time 118.366
[roger][core.py][ validate] INFO: Query connectivity:TOPMED Connectivity ran in 122ms: MATCH (a { id : 'TOPMED.VAR:phv00000517.v1.p10' })--(b) RETURN a.category, b.id
+-----------------+--------------------------------+
| b'a.category' | b'b.id' |
+-----------------+--------------------------------+
| ['named_thing'] | TOPMED.TAG:40 |
| ['named_thing'] | TOPMED.STUDY:phs000007.v29.p10 |
+-----------------+--------------------------------+
Cached execution 0.0
internal execution time 120.7783
[roger][core.py][ validate] INFO: Query connectivity:TOPMED Connectivity ran in 128ms: MATCH (a { id : 'TOPMED.VAR:phv00000518.v1.p10' })--(b) RETURN a.category, b.id
+-----------------+--------------------------------+
| b'a.category' | b'b.id' |
+-----------------+--------------------------------+
| ['named_thing'] | TOPMED.STUDY:phs000007.v29.p10 |
| ['named_thing'] | TOPMED.TAG:7 |
+-----------------+--------------------------------+
Cached execution 0.0
internal execution time 115.9252
[roger][core.py][ validate] INFO: Query connectivity:TOPMED Connectivity ran in 120ms: MATCH (a { id : 'TOPMED.VAR:phv00000528.v1.p10' })--(b) RETURN a.category, b.id
+-----------------+--------------------------------+
| b'a.category' | b'b.id' |
+-----------------+--------------------------------+
| ['named_thing'] | TOPMED.TAG:8 |
| ['named_thing'] | TOPMED.STUDY:phs000007.v29.p10 |
+-----------------+--------------------------------+
Cached execution 0.0
internal execution time 164.6341
[roger][core.py][ validate] INFO: Query connectivity:TOPMED Connectivity ran in 170ms: MATCH (a { id : 'TOPMED.VAR:phv00000529.v1.p10' })--(b) RETURN a.category, b.id
+-----------------+--------------------------------+
| b'a.category' | b'b.id' |
+-----------------+--------------------------------+
| ['named_thing'] | TOPMED.STUDY:phs000007.v29.p10 |
| ['named_thing'] | TOPMED.TAG:7 |
+-----------------+--------------------------------+
Cached execution 0.0
internal execution time 137.6454
[roger][core.py][ validate] INFO: Query connectivity:TOPMED Connectivity ran in 144ms: MATCH (a { id : 'TOPMED.VAR:phv00000530.v1.p10' })--(b) RETURN a.category, b.id
+-----------------+--------------------------------+
| b'a.category' | b'b.id' |
+-----------------+--------------------------------+
| ['named_thing'] | TOPMED.TAG:8 |
| ['named_thing'] | TOPMED.STUDY:phs000007.v29.p10 |
+-----------------+--------------------------------+
Cached execution 0.0
internal execution time 138.8376
[roger][core.py][ validate] INFO: Query connectivity:TOPMED Connectivity ran in 143ms: MATCH (a { id : 'TOPMED.VAR:phv00000531.v1.p10' })--(b) RETURN a.category, b.id
+-------------+-------------+
| b'count(a)' | b'count(b)' |
+-------------+-------------+
| 1295945 | 1295945 |
+-------------+-------------+
Cached execution 0.0
internal execution time 1661.8929
[roger][core.py][ validate] INFO: Query count_connected_nodes:Count Connected Nodes ran in 1666ms: MATCH (a)-[e]-(b) RETURN count(a), count(b)
+-----------------------+-----------------------+
| b'count(distinct(a))' | b'count(distinct(b))' |
+-----------------------+-----------------------+
| 12156 | 196144 |
+-----------------------+-----------------------+
Cached execution 0.0
internal execution time 3538.9259
[roger][core.py][ validate] INFO: Query query_by_type:Query by Type ran in 3543ms: MATCH (a:gene)-[e]-(b) WHERE 'chemical_substance' IN b.category RETURN count(distinct(a)), count(distinct(b))
This is a local run of Roger in Airflow. Next steps: Kubernetes.
Detailed feedback for each task is available including output logs
In one window:
airflow scheduler
In another:
airflow webserver -p 8080
Open localhost:8080 in a browser.
Then run:
python tranql_translate.py
The Airflow interface shows the workflow:
Use the Trigger icon to run the workflow immediately.
Roger supports installing on kubernetes via Helm.
Create a pvc(roger-data-pvc) for storing roger Data with the following definition.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: roger-data-pvc
spec:
storageClassName: <storage-class>
accessModes:
- ReadWriteMany
resources:
requests:
storage: <size>
Then run :
kubectl -n <NAMESPACE> create -f pvc.yaml
There are two secrets for airflow required for Git syncronization.
This is used by airflow.airflow.config.AIRFLOW__KUBERNETES__GIT_SSH_KEY_SECRET_NAME
kind: Secret
apiVersion: v1
metadata:
name: airflow-secrets
data:
gitSshKey: >-
<private-key-base64-encoded>
type: Opaque
This used by airflow.dags.git.secret
kind: Secret
apiVersion: v1
metadata:
name: airflow-git-keys
data:
id_rsa: <private-key-base64-encoded>
id_rsa.pub: <public-key-base64-encoded>
known_hosts: <known-hosts>
type: Opaque
Navigate to roger/bin
dir, and run roger init
. This will initialize helm dependencies for airflow helm repo)
and redis helm repo.
cd bin/
export NAMESPACE=<your_namespace, default>
export RELEASE_NAME=<install_name, airflow>
export CLUSTER_DOMAIN=cluster.local
./roger init
Run and flow the notes to access the servers.
./roger start
In the Notes a port forward command should be printed. Use that to access airflow UI and run the following steps to run Roger workflow.
The Airflow interface shows the workflow:
Press Trigger to get to the following page:
Enter the configuration parameters to get to Redis cluster installed in step 2:
{"redisgraph": {"host": "<redis-master-service-name>", "port": 6379 , "graph" : "graph-name" }}
And run work flow.
To shutdown and remove the setup from k8s:
./roger stop
To restart the setup:
./roger restart