
Examining the versatility of database management systems, we use Neo4j along with Spark to establish connections between nodes and optimize user question-oriented queries.

🚀 This project examines graph-based data storage for managing and connecting nodes from components such as Compounds, Diseases, Genes, and Anatomies with an interactive Graphical User Interface for Queries.

📖 Files

  • The "nodes_test.tsv" contains over 20,000 nodes pertaining to these four element types with each unique attribute such as ID, Name, and Kind.
ID Name Kind
Anatomy::UBERON:0000042 serous membrane Anatomy
Compound::DB00396 Progesterone Compound
  • The "edges_test.tsv" file contains over 1M edge relationships between a target and source node with individually labeled relationship types referred to by the "metaedges.tsv" file.
Metaedge abbreviation edges source_nodes target_nodes unbiased
Anatomy downregulates Gene AdG 102240 36 15097 102240
Anatomy - expresses - Gene AeG 526407 241 18094 453477
Anatomy - upregulates - Gene AuG 97848 36 15929 97848


  • Files Can be hosted on a local Python server using: python3 -m http.server

  • The Neo4J Base Server Requires authentication and must remain active.

  • When creating a database, the data should only be loaded once for both Nodes and Edges.

  • To execute a query from the terminal:

    • run: python3 projectBD.py <"QUERY SELECTED">

Example Node Structure


  • When finding possible treatments for Diseases that have no direct connection to any Compound, the approach for such a query is to begin by navigating the genes that are DownRegulated\UpRegulated by a Compound and Anatomy, in the opposite direction, in which the same Disease localizes. This would create the following graph.


The following Cypher Queries solve a specific portion of the project using Neo4J as a graph-based NoSQL Store.

Return Disease Name

MATCH (n WHERE n.name='Disease' AND 
n.id ='Disease::DOID:8577') 

Return Compounds that Palliate or Treat Disease

MATCH m=(n:Data)-[:CpD|CtD]->(b:Data where 
b.id='Disease::DOID:7148') RETURN n

Return Genes that Cause this Disease

MATCH p=(a:Data WHERE a.id='Disease::DOID:7148')
-[r:DaG]->(n:Data where n.name ='Gene') RETURN n

Return Where Disease Occurs

MATCH p=(a:Data WHERE a.id ='Disease::DOID:7148')
-[r:DlA]->(n:Data) RETURN 

Potential Cures to Diseases

match p = (d:Data where d.name='Disease')-[:DlA]->
(a:Data where a.name ='Anatomy')-[:AuG|AdG]->(g:Data where g.name ='Gene')with d,a,g
match (n:Data where n.name='Compound')-[:CdG|CuG]->
(f:Data where f.name ='Gene' and f.id = g.id)
with d,a,g,n match (n) where not (n)-[:CtD|CpD]->(d) return n


LOAD CSV WITH HEADERS FROM "http://localhost:8000/nodes_test.tsv" 
Create (n:Data {name:row.kind, id:row.id, dataName:row.name})


LOAD CSV WITH HEADERS FROM "http://localhost:8000/edges_test.tsv" AS row FIELDTERMINATOR "\t"
WITH row
WHERE row.ource IS NOT NULL AND row.target IS NOT NULL and row.metaedge is not NULL
MERGE (s:Data {id: row.ource})
MERGE (t:Data {id: row.target})
WITH s, t, row
CALL apoc.create.relationship(s, row.metaedge, {}, t) YIELD rel