RDKit IPython Tools

by Axel Pahl

A set of tools to use with the Open Source Cheminformatics toolkit RDKit in the Jupyter Notebook.
Written for Python3, only tested on Linux (Ubuntu 16.04) and the conda install of the RDkit.

Module tools

A Mol_List class was introduced, which is a subclass of a Python list for holding lists of RDKit molecule objects and allows direct access to a lot of the RDKit functionality. It is meant to be used with the Jupyter Notebook and includes a.o.:

display of the Mol_List
- as HTML table, nested table or grid
display of a summary including number of records and min, max, mean, median for numeric properties
display of correlations between the Mol_List's properties (using np.corrcoef, this allows getting a quick overview on which properties correlate with each other)
methods for sorting, searching (by property or substructure) and filtering the Mol_List
methods for renaming, reordering and calculating properties
direct plotting of properties as publication-grade Highcharts or Bokeh plots with structure tooltips (!).
- the plotting functionalities reside in their own module and can also be used for plotting Pandas dataframes and Python dicts.
- further development will focus on Bokeh because of the more pythonic interface

Other functions in the tools module:

jsme: Display Peter Ertl's Javascript Molecule Editor to enter a molecule directly in the IPython notebook (how cool is that??).
The module tries to find a local version of JSME in <notebook_dir>/lib/ and when it fails to do so, loads a web version of the editor. I use a central lib/ folder and create symlinks to it in all notebook folders where I want to use these libraries

...plus many others.

Module pipeline

A Pipelining Workflow using Python Generators, mainly for RDKit and large compound sets. The use of generators allows working with arbitrarily large data sets, the memory usage at any given time is low.

Example use:

>>> from rdkit_ipynb_tools import pipeline as p
>>> s = Summary()
>>> rd = start_csv_reader(test_data_b64.csv.gz", summary=s)
>>> b64 = pipe_mol_from_b64(rd, summary=s)
>>> filt = pipe_mol_filter(b64, "[H]c2c([H])c1ncoc1c([H])c2C(N)=O", summary=s)
>>> stop_sdf_writer(filt, "test.sdf", summary=s)

or, using the pipe function:

>>> s = Summary()
>>> rd = start_sdf_reader("test.sdf", summary=s)
>>> pipe(rd,
>>>      pipe_keep_largest_fragment,
>>>      (pipe_neutralize_mol, {"summary": s}),
>>>      (pipe_keep_props, ["Ordernumber", "NP_Score"]),
>>>      (stop_csv_writer, "test.csv", {"summary": s})
>>>     )

The progress of the pipeline is displayed as a HTML table in the Notebook and can also be followed in a separate terminal with: watch -n 2 cat pipeline.log.

Currently Available Pipeline Components:

Starting	Running	Stopping
start_cache_reader	pipe_calc_props	stop_cache_writer
start_csv_reader	pipe_custom_filter	stop_count_records
start_mol_csv_reader	pipe_custom_man	stop_csv_writer
start_sdf_reader	pipe_do_nothing	stop_df_from_stream
start_stream_from_dict	pipe_has_prop_filter	stop_dict_from_stream
start_stream_from_mol_list	pipe_id_filter	stop_mol_list_from_stream
	pipe_inspect_stream	stop_sdf_writer
	pipe_join_data_from_file
	pipe_keep_largest_fragment
	pipe_keep_props
	pipe_merge_data
	pipe_mol_filter
	pipe_mol_from_b64
	pipe_mol_from_smiles
	pipe_mol_to_b64
	pipe_mol_to_smiles
	pipe_neutralize_mol
	pipe_remove_props
	pipe_rename_prop
	pipe_sim_filter
	pipe_sleep

Limitation: unlike in other pipelining tools, because of the nature of Python generators, the pipeline can not be branched.

Other Modules

Clustering

Fully usable, documentation needs to be written. Please refer to the docstrings until then.

Scaffolds

New, WIP, not usable. Has been moved to the scaffolds branch.

Tutorial

Much of the functionality is shown in the tools tutorial notebook. SAR functionality is shown in the SAR tutorial notebook. The SAR module is new and Work in Progress.

Documentation

The module documentation can be built with sphinx using the make_doc.sh script

Installation

Requirements

The recommended way to use this project is via conda.

Python 3
RDKit
Jupyter Notebook
ipywidgets

Highly recommended

cairo (via conda or pip) and cairocffi (only via pip) to get decent-looking structures
Bokeh for high-quality data plots with structure tooltips

After installing the requirements, clone this repo, then the rdkit_ipynb_tools can be used by including the project's base directory (rdkit_ipynb_tools) in Python's import path (I actually prefer this to using setuptools, because a simple git pull will get you the newest version).
This can be achieved by one of the following:

If you use conda (recommended), use conda develop. This works similar to the next option.
Put a file with the extension .pth, e.g. my_packages.pth, into one of the site-packages directories of your Python installation and put the path to the base directory of this project (rdkit_ipynb_tools) into it.
(I have the path to a dedicated folder on my machine included in such a .pth file and link all my development projects to that folder. This way, I only need to create the .pth file once.)

Tips & Tricks

Pipelines, Structures and Performance

Processing data from 200k compounds takes 10-15 sec on my notebook.

Substructure searches take longer.

For performance reasons, I use b64encode and pickle strings of mol objects to store the molecule structures in text format
(see also Greg's blog post for faster structure generation):

b64encode(pickle.dumps(mol)).decode()

For me, that has proven to be the fastest method when dealing with flat text files and is also the reason why there are pipe_mol_to_b64 and pipe_mol_from_b64 components in the pipeline module.

Working Offline

When you use a local copy of the Javascript Molecule Editor as described above and use Bokeh for plotting, you can work completely offline in your Notebook.

Roadmap

make pipelines more user-friendly
complete the scaffolds module
add functionality as needed / requested

(probably not in this order)

apahl/rdkit_ipynb_tools