RDKit-Neo4j project

Open Chemistry, RDKit & Neo4j GSoC 2019 project

Abstract

Chemical and pharmaceutical R&D produce large amounts of data of completely different nature, such as chemical structures, recipe and process data, formulation data, and data from various application tests. Altogether these data rarely follow a schema. Consequently, relational data models and databases have frequetly disadvantages mapping these data appropriately. Here, chemical data frequently leads to rather abstract data models, which are difficult to develop, align, and maintain with the domain experts. Upon retrieval computationally expesive joins in not predetermined depths may cause issues.

Graph data models promise here advantages:

they can easily be understood by non IT experts from the research domains

due to their plasticity, they can easily be extended and refactored

graph databases such as neo4j are made for coping with arbitrary path lengths

Chemical data models usually require a database to be able to deal with chemical structures to be utilized for structure based queries to either identify records or as filtering criteria.

The project will be focused on development of extension for neo4j graph database for querying knowledge graphs storing molecular and chemical information. Task is to enable identification of entry points into the graph via exact/substructure/similarity searches (UC1). UC2 is closely related to UC1, but here the intention is to use chemical structures as limiting conditions in graph traversals originating from different entry points. Both use cases rely on the same integration of RDkit and Neo4j and will only differ in their CYPHER statements.

Mentors:

Greg Landrum
Christian Pilger
Stefan Armbruster

Build & run

Install lib/org.RDKit.jar and lib/org.RDKitDoc.jar into your local maven repository

mvn org.apache.maven.plugins:maven-install-plugin:2.3.1:install-file \
                         -Dfile=lib/org.RDKit.jar -DgroupId=org.rdkit \ 
                         -DartifactId=rdkit -Dversion=1.0.0 \
                         -Dpackaging=jar
                         
mvn org.apache.maven.plugins:maven-install-plugin:2.3.1:install-file \
                         -Dfile=lib/org.RDKitDoc.jar -DgroupId=org.rdkit \ 
                         -DartifactId=rdkit-doc -Dversion=1.0.0 \
                         -Dpackaging=jar

Generate .jar file with all dependencies with mvn package
Put generated .jar file into plugins/ folder of your neo4j instance and start the server
add server.rdkit.index.sanitize=false to neo4j.confif you want to switch of sanitizing for indexing. If not provided true is assumed as default.
By executing CALL dbms.procedures(), you are expected to see org.rdkit.* procedures

usage within Docker

The native libraries of rdkit do have a dependency on libFreetype and libPng. On desktop Linux systems those are typically installed by default. The Neo4j docker image is based on openjdk:11-jdk-slim which itself is based on a minimal Debian linux image. This does not contain these to libraries. To solve that you need to make sure these packages get installed.

In docker_example there's a script run_docker.sh mounting a volume with these debian packages and using an extension script to install these images upon startup of the docker container. Before using that make sure to populate the plugins folder with the plugin's jar file.

Extension functionality

User scenario:

Feeding the data into database

way A:

Plugin not present
Feed Neo4j DB
then CALL org.rdkit.update(['Chemical', 'Structure']) & CALL org.rdkit.search.createIndex(['Structure', 'Chemical'])

That triggers computation of additional properties (fp, etc.) and fp index creation
Automated computation of properties enabled only after update procedure

way B:

Plugin present
Feed Neo4j DB
then CALL org.rdkit.search.createIndex(['Structure', 'Chemical'])

Automated computation of additional properties (fp, etc.) and triggered index
Fp index automatically updated when new :Structure:Chemical records arrive

way C (the most suitable)

Plugin present
CALL org.rdkit.search.createIndex(['Structure', 'Chemical'])
Then feed Knime

Automated computation of additional properties (fp, etc.) and index
Empty Neo4j instance is prepared in advance
Whenever a new :Structure:Chemical entries comes, property calculation and fp index update are automatically conducted

Execution of exact search

It is possible to check index existence with CALL db.indexes

It would strongly affect performance of exact search if createIndex procedure was called earlier (it creates a property index).
CALL org.rdkit.search.exact.smiles(['Chemical', 'Structure'], 'CC(=O)Nc1nnc(S(N)(=O)=O)s1')
CALL org.rdkit.search.exact.mol(['Chemical', 'Structure'], '<mdlmol>') (refer to tests for examples)

Execution of substructure search

Make sure the fulltext index exists with CALL db.indexes, fp_index must exist. (It should be created with createIndex procedure)
CALL org.rdkit.search.substructure.smiles(['Chemical', 'Structure'], 'CC(=O)Nc1nnc(S(N)(=O)=O)s1', <santize> (true/false))
CALL org.rdkit.search.substructure.mol(['Chemical', 'Structure'], '<mol value>', <santize> (true/false))

Execution of similarity search (currently slow)

CALL org.rdkit.fingerprint.create(['Chemical, 'Structure'], 'torsion_fp', 'torsion', <santize> (true/false)) - new property torsion_fp is created
CALL org.rdkit.fingerprint.search.smiles(['Chemical', 'Structure'], 'CC(=O)Nc1nnc(S(N)(=O)=O)s1', 'torsion', 'torsion_fp', 0.4, <santize> (true/false))
CALL org.rdkit.fingerprint.search.smiles(['Chemical', 'Structure'], 'CC(=O)Nc1nnc(S(N)(=O)=O)s1', 'pattern', 'fp', 0.7, <santize> (true/false))

Usage of `org.rdkit.search.substructure.is.smiles` function in complex queries

CALL org.rdkit.search.exact.smiles(['Chemical', 'Structure'], 'CC(C)(C)OC(=O)N1CCC(COc2ccc(OCc3ccccc3)cc2)CC1') YIELD luri
MATCH (finalProduct:Entity{luri:luri})
CALL apoc.path.expand(finalProduct, "<HAS_PRODUCT,>HAS_INGREDIENT", ">Reaction", 0, 4) yield path
WITH nodes(path)[-1] as reaction, path, (length(path)+1)/2 as depths
MATCH (reaction)-[:HAS_INGREDIENT]->(c:Compound) where org.rdkit.search.substructure.is(c, 'CC(C)C(O)=O')
RETURN path

CALL org.rdkit.search.exact.smiles(['Chemical', 'Structure'], 'CC(C)(C)OC(=O)N1CCC(COc2ccc(OCc3ccccc3)cc2)CC1') YIELD luri
MATCH (finalProduct:Entity{luri:luri})
CALL apoc.path.expand(finalProduct, "<HAS_PRODUCT,>HAS_INGREDIENT", ">Reaction", 0, 4) yield path
WITH nodes(path)[-1] AS reaction, path, (length(path)+1)/2 AS depths
MATCH (reaction)-[:HAS_INGREDIENT]->(c:Compound)
WITH path, COLLECT(c) as compounds
WHERE ANY( x IN compounds where org.rdkit.search.substructure.is.mol(x, '
  Ketcher  9 71921 82D 1   1.00000     0.00000     0
  6  5  0     0  0            999 V2000
  8.9170  -12.3000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  9.7830  -11.8000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  10.6490  -12.3000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  9.7830  -10.8000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  10.6490  -10.3000    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
  8.9170  -10.3000    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0     0  0
  2  3  1  0     0  0
  2  4  1  0     0  0
  4  5  1  0     0  0
  4  6  2  0     0  0
  M  END'))
RETURN path

Usage of `org.rdkit.utils.svg` function

CALL org.rdkit.search.exact.smiles(['Chemical', 'Structure'], 'CCCC(C(=O)Nc1ccc(S(N)(=O)=O)cc1)C(C)(C)C') 
YIELD canonical_smiles 
RETURN org.rdkit.utils.svg(canonical_smiles) as svg

Node labels: [`Chemical`, `Structure`] - strict rule (!)

Whenever a new node added with labels, an rdkit event handler is applied and new node properties are constructed from mdlmol property. Those are also reserved property names

canonical_smiles
inchi
formula
molecular_weight
fp - bit-vector fingerprint in form of indexes of positive bits ("1 4 19 23")
fp_ones - count of positive bits
mdlmol

Additional reserved property names:

smiles

If the graph was fulfilled with nodes before the extension was loaded, it is possible to apply a procedure:
CALL org.rdkit.update(['Chemical', 'Structure']) - which iterates through nodes with specified labels and creates properties described before.
In order to speed up an exact search, create an index on top of canonical_smiles property

User-defined procedures & functions

CALL org.rdkit.search.exact.smiles(['Chemical', 'Structure'], 'CC(=O)Nc1nnc(S(N)(=O)=O)s1')
CALL org.rdkit.search.exact.mol(['Chemical', 'Structure'], '<mdlmol block>')
- RDKit provides functionality to use exact search on top of smiles and mdlmol blocks, returns a node which satisfies canonical smiles
CALL org.rdkit.update(['Chemical', 'Structure'])
- Update procedure (manual properties initialization from mdlmol property)
- Current implementation uses single thread and on a huge database may take a lot of time (>3 minutes)
CALL org.rdkit.search.createIndex(['Chemical', 'Structure'])
- Create fulltext index (called rdkitIndex) on property fp, which is required for substructure search
- Create index for :Chemical(canonical_smiles) property
CALL org.rdkit.search.deleteIndex() * Delete fulltext index (called rdkitIndex) on property fp, which is required for substructure search
* Delete index for :Chemical(canonical_smiles) property
CALL org.rdkit.search.substructure.smiles(['Chemical', 'Structure'], 'CC(=O)Nc1nnc(S(N)(=O)=O)s1')
- SSS based on smiles substructure
CALL org.rdkit.search.substructure.mol(['Chemical', 'Structure'], '<mol value>')
- SSS based on mdlmol block substructure
CALL org.rdkit.fingerprint.create(['Chemical, 'Structure'], 'morgan_fp', 'morgan')
- Create a new property called morgan_fp with fingerprint type morgan on all nodes
- Supporting properties are: morgan_fp_type, morgan_fp_ones are also added
- Creates fulltext index on this property
- Node is skipped if it's not possible to convert its smiles with this fingerprint type
- It is not allowed to use property name equal to predefined
CALL org.rdkit.fingerprint.search.smiles(['Chemical', 'Structure'], 'CC(=O)Nc1nnc(S(N)(=O)=O)s1', 'pattern', 'fp', 0.7)
- Call similarity search with next parameters:
  - Node labels: ['Chemical', 'Structure']
  - Smiles: 'CC(=O)Nc1nnc(S(N)(=O)=O)s1'
  - Fingerprint type: 'pattern'
  - Property name: 'fp'
  - Threshold: 0.7
- Smiles value is converted into specfied fingerprint type (if possible) and compared with nodes which have property ('fp' in this case)
- Threshold is a lower bound for the score value
- Current implementation uses single thread and on a huge database may take a lot of time (>3 minutes)
User-defined functions
- org.rdkit.search.substructure.is.smiles(<node object>, '<smiles_string>')
- org.rdkit.search.substructure.is.mol(<node object>, '<mol_string>')
- Return boolean answer: does specified node object have substructure match provided by smiles_string or mol_string.
User-defined function org.rdkit.utils.svg('<smiles_string>')
- Return svg image in text format from smiles

Results overview

What was achieved

Implementation of exact search (100%)
Implementation of substructure search (90%, several minor bugs)
Implementation of condition based graph traversal - usage of function calls in complex queries (100%)
Implementation of similarity search (70%, major performance issues)
Coverage with unit tests (80%, not all invalid arguments for procedures are tested)

What remains to be done

Speed up batch tasks by utilizing several threads (currently waiting for resolving issue on native level)
Speed up the similarity search procedures
Solve minor bugs (todos) like unclosed query object during SSS

What problems were encountered

Compatability of native libraries for win64 (beginning of the development)
Lazy streams evaluation and not resolved issue with query object during SSS
Parallelization of stream evaluations

Java requirements

Plugin supports openjdk and oraclejdk java versions (< 12).
Further versions upgraded security sensitive fields policy, those are currently not supported.

rdkit/neo4j-rdkit