Action sequence prediction for arbitrary chemical equations
This repository contains the code for Inferring Experimental Procedures from Text-Based Representations of Chemical Reactions.
- Overview
- System Requirements
- Installation Guide
- Training transformer
- Code examples for processing compound names and actions
- Evaluation and notebooks
Overview
This repository contains code for the prediction of action sequences for arbitrary chemical equations. In particular, it contains the following:
- Training and usage of a transformer-based model
- Simplification of compound names
- Validation and post-processing of action sequences
- Tokenization of compounds, temperatures, and durations
A trained model is integrated in the IBM RXN platform and can be freely used online at https://rxn.res.ibm.com.
System Requirements
Hardware requirements
The code can run on any standard computer. It is recommended to run the training scripts in a GPU-enabled environment.
Software requirements
OS Requirements
This package is supported for macOS and Linux. The package has been tested on the following systems:
- macOS: Big Sur (11.1)
- Linux: Ubuntu 18.04.4
Python
A Python version of 3.6 or greater is recommended.
The Python package dependencies are listed in requirements.txt
.
Installation guide
To use the package, we recommend creating a dedicated Conda environment:
conda create -n smiles2actions python=3.6 -y
conda activate smiles2actions
Then, the following command will install the package and its dependencies:
pip install -e .
The installation should not take more than a few minutes.
Training the transformer model
Instructions for training the transformer model are given here.
Examples
Code examples for the processing of compound names and actions are presented in the examples directory.
Simplification of compound names
A script illustrating the simplification of compound names is given here.
Output example:
Processing the name "dcm solution of 1:1 water / 30% sulfuric acid"
Checking the following simplification: dcm solution of 1:1 water / 30% sulfuric acid
Checking the following simplification: dcm, 1:1 water / 30% sulfuric acid
Checking the following simplification: dcm, water / sulfuric acid
Checking the following simplification: dcm, water, sulfuric acid
Simplified name(s): dcm, water, sulfuric acid
Replaced by synonym(s): DCM, water, H2SO4
Action validation
A script illustrating the validation of actions here. The functionality presented there is necessary to filter out undesired action sequences from the data set.
Action postprocessing
In one of the examples, we illustrate how actions are postprocessed during the data set generation, with changes such as:
- Harmonize formulation of equivalent actions (
MakeSolution
/Add
,Wait
, etc.) - Tokenization of durations, temperatures, pH values
- Removal of quantities
- etc.
Example output:
OLD: MAKESOLUTION with CHCl2 (2 ml) and water (3 ml) ; ADD SLN ; STIR for 3 hours ; PH with acetic acid to pH 9.3 ; YIELD product
NEW: ADD CHCl2 ; ADD water ; STIR for @3@ ; PH with acetic acid to pH basic ; YIELD product
Compound tokenization
The tokenization of the compounds is illustrated in another script.
Example output:
OLD: ADD ethane ; ADD methane ; ADD sodium chloride ; STIR for 8 hours ; QUENCH with brine ; YIELD propane
NEW: ADD $1$ ; ADD $2$ ; ADD $3$ ; STIR for 8 hours ; QUENCH with brine ; YIELD $-1$
Evaluation and notebooks
The IPython notebooks in this repository can be executed with jupyter lab
.
They assume the relevant data to be present in the directory given as the S2A_PAPER_DATA_DIR
environment variable.
The notebook metrics.ipynb is used to calculate the metrics presented in the paper.
Additional notebooks are included: