Genetic and phenotype data in JSON, VCF and CSV format and convert them into CSV files that represent Nodes and Relationships that can then be used to populate Pheno4J using the neo4j bulk CSV import tool.
Only two publicly available datasets required:
Example datasets specified in config.properties:
- VCF file which contains genotypes (example)
- VEP JSON file (example)
- Individuals with HPO terms as CSV file (example)
The local version will not be able to handle efficiently a very large dataset since it does not have access to the configuration for the page cache and jvm size. Hence it should be used for testing.
- Java 1.8
- Maven 3
Download the code, build the database, load the test data referenced in config.properties and start the server on port 7474:
git clone https://github.com/phenopolis/pheno4j.git
cd pheno4j
mvn clean compile -P build-graph,run-neo4j
Once the server is running, it can be queried either by going to the web interface on http://localhost:7474/ or using curl to do http requests from the command line (see next section).
The curl http queries return data in JSON format and so the response can be parsed using jq.
For example, get count of variants shared between person1 and person2:
curl -H "Content-Type: application/json" -d '{
"query": "WITH [$p1,$p2] AS persons MATCH (p:Person)<-[]-(v:GeneticVariant) WHERE p.personId IN persons WITH v, count(*) as c, persons WHERE c = size(persons) RETURN count(v.variantId);",
"params":{"p1":"person1","p2":"person2"}
}' http://localhost:7474/db/data/cypher
Get ids of persons with variant 22-51171497-G-A:
curl -H "Content-Type: application/json" -d '{
"query": "MATCH (gv:GeneticVariant)-[]->(p:Person) WHERE gv.variantId =$var RETURN p.personId;",
"params":{"var":"22-51171497-G-A"}
}' http://localhost:7474/db/data/cypher
More cypher queries are available here.
The server installation can scale to very large datasets as it allows configuration of the JVM size and page cache.
- Java 1.8
- Neo4j installation - download from https://neo4j.com/download/community-edition/, extract the archive. The location of the extract will be referred to as $NEO4J_HOME
Run the following in the checkout directory, which will generate a zip file, "graph-bundle.zip", in the target folder:
mvn clean package
Copy graph-bundle.zip
to your target server and unzip it.
In the conf
folder of the extracted zip above, update config.properties to reference your input data.
This step will take all the input data and build csv files, which are then built into a Neo4j database using their ImportTool. Constraints and Indexes are then created. In the lib folder of the extracted zip above, run the following:
java -cp *:../conf/ com.graph.db.GraphDatabaseBuilder
cd $NEO4J_HOME/data/databases
ln -s ${output.folder}/graph-db/data/databases/graph.db graph.db
${output.folder} is defined in config.properties
Ideally you should hold as much of the data in memory as possible (See here for more information)
Set the value of dbms.memory.pagecache.size
in ${NEO4J_HOME}/conf/neo4j.conf to the size of the files: NEO4J_HOME/data/databases/graph.db/*store.db*
cd $NEO4J_HOME/bin
./neo4j start
This query will basically hit the entire graph, the result will be all the data stored on the disk will be loaded into memory. (See here for more information) This takes up to 10 minutes for our data.
MATCH (n)
OPTIONAL MATCH (n)-[r]->()
RETURN count(n.prop) + count(r.prop);
If you would like to connect to your instance from your application tier to handle incoming database requests, you can change the password to the Neo4j instance with the following; the port is the value of dbms.connector.http.listen_address
in $NEO4J_HOME/conf/neo4j.conf
.
The following command will the password to 1
:
curl -H "Content-Type: application/json" -X POST -d '{"password":"1"}' -u neo4j:neo4j http://**{HOST}**:**{PORT}**/user/neo4j/password
Examples can be found here.