
Atom-in-SMILES tokenizer for SMILES strings.

Primary LanguageJupyter NotebookOtherNOASSERTION

License: CC BY-NC 4.0 J. Cheminformatics DOI

Atom-in-SMILES tokenization.

Ucak UV, Ashyrmamatov I, Lee J (2023) Improving the quality of chemical language model outcomes with atom-in-SMILES tokenization. J Cheminformatics 15:55. https://doi.org/10.1186/s13321-023-00725-9

Tokenization is an important preprocessing step in natural language processing that may have a significant influence on prediction quality. This research showed that the traditional SMILES tokenization has a certain limitation that results in tokens failing to reflect the true nature of molecules. To address this issue, we developed the atom-in-SMILES tokenization scheme that eliminates ambiguities in the generic nature of SMILES tokens. Our results in multiple chemical translation and molecular property prediction tasks demonstrate that proper tokenization has a significant impact on prediction quality. In terms of prediction accuracy and token degeneration, atom-in-SMILES is more effective method in generating higher-quality SMILES sequences from AI-based chemical models compared to other tokenization and representation schemes. We investigated the degrees of token degeneration of various schemes and analyzed their adverse effects on prediction quality. Additionally, token-level repetitions were quantified, and generated examples were incorporated for qualitative examination. We believe that the atom-in-SMILES tokenization has a great potential to be adopted by broad related scientific communities, as it provides chemically accurate, tailor-made tokens for molecular property prediction, chemical translation, and molecular generative models.

Tutorial - Google Colab Notes


It can be installed using pip.

pip install atomInSmiles

or clone it from the GitHub repository and install locally.

git clone https://github.com/snu-lcbc/atom-in-SMILES
cd atom-in-SMILES
python setup.py install

Usage & Demo

Brief descriptions of the main functions:

Function Description
atomInSmiles.encode Converts a SMILES string into Atom-in-SMILES tokens.
atomInSmiles.decode Converts an Atom-in-SMILES tokens into SMILES string.
atomInSmiles.similarity Calcuates Tanimoto coefficient of two Atom-inSMILSE tokens.
import atomInSmiles

smiles = 'NCC(=O)O'

# SMILES -> atom-in-SMILES 
ais_tokens = atomInSmiles.encode(smiles) # '[NH2;!R;C] [CH2;!R;CN] [C;!R;COO] ( = [O;!R;C] ) [OH;!R;C]'

# atom-in-SMILES -> SMILES
decoded_smiles = atomInSmiles.decode(ais_tokens) #'NCC(=O)O'

assert smiles == decoded_smiles

NOTE: By default, it first canonicalizes the input SMILES. In order to get atom-in-Smiles tokens with the same order of SMILES, the input SMILES should be provided with atom map numbers.

from rdkit.Chem import MolFromSmiles, MolToSmiles
import atomInSmiles

import atomInSmiles
# ensuring the order of SMILES in atom-in-SMILES. 
smiles = 'NCC(=O)O'
mol = MolFromSmiles(smiles)
random_smiles = MolToSmiles(mol, doRandom=True) # e.g 'C(C(=O)O)N' 

# mapping atomID into SMILES srting
tmp = MolFromSmiles(random_smiles)
for atom in tmp.GetAtoms():
smiles_1 = MolToSmiles(tmp) # 'C([C:1](=[O:2])[OH:3])[NH2:4]' 

# SMILES -> atom-in-SMILES
ais_tokens_1 = atomInSmiles.encode(smiles_1, with_atomMap=True) # '[CH2;!R;CN] ( [C;!R;COO] ( = [O;!R;C] ) [OH;!R;C] ) [NH2;!R;C]'

# atom-in-SMILES -> SMILES
decoded_smiles_1 = atomInSmiles.decode(ais_tokens_1) # 'C(C(=O)O)N'

assert random_smiles == decoded_smiles_1

Implementations & Results

Implementation Items Description
Single-step retrosynthesis python src/predict.py to conduct an inference with the trained model
--model_type (SMILES, SELFIES, DeepSmiles, SmilesPE, AIS)
--checkpoint_name name of the checkpoint file checkpoints files
--input Tokenized input sequence
Molecular Property Prediction Molecular-property-prediction.ipynb MoleculeNet: Classification (ESOL, FreeSolv, Lipo.), Regression (BBBP, BACE, HIV)
Normalized repetition rate Normalized-Repetition-Rates.ipynb Natural products, drugs, metal complexes, lipids, stereoids, isomers
Fingerprint nature of AIS AIS-as-fingerprint.ipynb AIS fingerprint resolution
Single-token repetition (rep-l) rep-l_USPTO50k.ipynb USPTO-50K, retrosynthetic translations
input-output equivalent mapping GDB13-results.ipynb Augmented subset of GDB-13, noncanon-2-canon translations

For example, in retrosynthesis task:

python src/predict.py --model_type AIS  --checkpoint_name AIS_checkpoint.pth
 --input='[CH3;!R;O] [O;!R;CC] [C;!R;COO] ( = [O;!R;C] ) [c;R;CCS] 1 [cH;R;CC] [c;R;CCC] ( [CH2;!R;CC] [CH2;!R; CC] [CH2;!R;CC] [c;R;CCN] 2 [cH;R;CC] [c;R;CCC] 3 [c;R;CNO] ( = [O;!R;C] ) [nH;R;CC] [c;R;NNN] ( [NH2 ;!R;C] ) [n;R;CC] [c;R;CNN] 3 [nH;R;CC] 2 ) [cH;R;CS] [s;R;CC] 1'

Cite this work

year = {2023}, 
title = {{Improving the quality of chemical language model outcomes with atom-in-SMILES tokenization}}, 
author = {Ucak, Umit V. and Ashyrmamatov, Islambek and Lee, Juyong}, 
journal = {Journal of Cheminformatics}, 
doi = {10.1186/s13321-023-00725-9}, 
pages = {55}, 
number = {1}, 
volume = {15}, 
keywords = {}


CC BY-SA 4.0

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

CC BY-SA 4.0