/MolForge

Primary LanguageJupyter NotebookOtherNOASSERTION

License: CC BY-NC 4.0 DOI J. Cheminformatics DOI

Reconstruction of lossless molecular representations from fingerprints

Ucak UV, Ashyrmamatov I, Lee J (2023) Reconstruction of lossless molecular representations from fingerprints. J Cheminformatics 15:26. https://doi.org/10.1186/s13321-023-00693-0

The simplified molecular-input line-entry system (SMILES) is the most prevalent molecular representation used in AI-based chemical applications. However, there are innate limitations associated with the internal structure of SMILES representations. In this context, this study exploits the resolution and robustness of unique molecular representations, i.e., SMILES and SELFIES (SELF-referencIng Embedded strings), reconstructed from a set of structural fingerprints, which are proposed and used herein as vital representational tools for chemical and natural language processing (NLP) applications. This is achieved by restoring the connectivity information lost during fingerprint transformation with high accuracy. Notably, the results reveal that seemingly irreversible molecule-to-fingerprint conversion is feasible. More specifically, four structural fingerprints, extended connectivity, topological torsion, atom pairs, and atomic environments can be used as inputs and outputs of chemical NLP applications. Therefore, this comprehensive study addresses the major limitation of structural fingerprints that precludes their use in NLP models. Our findings will facilitate the development of text- or fingerprint-based chemoinformatic models for generative and translational tasks.


Code usage

Requirements

The source code is tested on Linux operating systems. After cloning the repository, we recommend creating a new conda environment and install the package locally. Users should install required packages described in environments.yml prior to direct use.

conda env create --name MolForge_env --file=environment.yml
conda activate MolForge_env
python -m pip install .

Prediction & Demo:

First, checkpoint files (top-performing or all the oher models) should be downloaded and extracted. The checkpoints files should be placed in ./saved_models/ directory. Then,run below commands to conduct an inference with the trained model.

python predict.py --fp  --model_type --input --checkpoint

where:

  • --fp : The name of fingerprint.
  • --model_type : Molecular representation e.g. 'smiles' or 'selfies'
  • --input : Bit number of the fingerprint (--fp).
  • --checkpoint : Checkpoint file for the given model. If None, it uses the downloaded checkpoints in the ./saved_models/.
  • --decode: Decoding algorithm (either 'greedy' or 'beam'), (by default: greedy)

Example prediction;

python predict.py --fp='ECFP4' --model_type='smiles' --input='1 80 94 114 237 241 255 294 392 411 425 695 743 747 786 875 1057 1171 1238 1365 1380 1452 1544 1750 1773 1853 1873 1970'

and its sample output;

Here we go..
          fp : ECFP4
  model_type : smiles
       input : 1 80 94 114 237 241 255 294 392 411 425 695 743 747 786 875 1057 1171 1238 1365 1380 1452 1544 1750 1773 1853 1873 1970
  input_file : None
  checkpoint : saved_models/ECFP4_smiles_checkpoint.pth
      decode : greedy
src_vocab_size : 2052
trg_vocab_size : 109
 src_seq_len : 104
 trg_seq_len : 130
    root_dir : /home/tmp/MolForge
  fp_datadir : /home/tmp/MolForge/data/fingerprints/ECFP4
src_sp_prefix : /home/tmp/MolForge/data/sp/ECFP4_vocab_sp
trg_sp_prefix : /home/tmp/MolForge/data/sp/smiles_vocab_sp
        rank : cuda
      device : cuda

The size of src vocab is 2052 and that of trg vocab is 109.
Loading checkpoint... ECFP4 smiles
Preprocessing input sentence...
Encoding input sentence...
Greedy decoding selected.

Input: 1 80 94 114 237 241 255 294 392 411 425 695 743 747 786 875 1057 1171 1238 1365 1380 1452 1544 1750 1773 1853 1873 1970
Result: C C O C 1 = C ( C = C ( C = C 1 ) C ( C ( C ) ( C ) C ) N ) O C C
Inference finished! || Total inference time: 0mins 0secs

Result

Each cell shows the Tanimoto exactness (%) of selected fingerprint transformation to SMILES (row-wise) computed at the respective fingerprint encodings(columns-wise). The consistency in color code reflects the robustness, while the jumps represent the effect of selection bias. ECFP2* and ECFP4* represent explicit bit versions.

MACCS Avalon RDK4 RDK4_L HashAP TT HashTT ECFP0 ECFP2 ECFP4 FCFP2 FCFP4 AEs ECFP2* ECFP4*
MACCS 77.4 33.3 38 39.8 32.2 33.2 33.2 52.2 34.7 32.5 48.6 33.5 34.7 37 33.3
Avalon 72.6 67.9 72.2 73.5 63.4 64.7 64.7 69.5 65.6 63.6 68.9 64.7 65.6 68.5 64.6
RDK4 66.9 60 90.9 91.5 59.8 61.1 61.1 62.5 60.2 58.3 62.3 59.6 60.2 64.3 59.6
RDK4_L 52.6 46.7 64.7 88.8 46.7 47.7 47.7 49.1 46.9 45.5 48.8 46.5 46.9 49.3 46.2
HashAP 86.5 83.8 89.6 90.2 85.2 85.5 85.5 84.3 83.1 82.5 84 82.8 83.1 86.1 84.1
TT 88.4 83.5 92.3 92.5 84.1 87.3 87.3 85.8 85.2 82.3 85.7 83.8 85.2 91.4 84.2
HashTT 86.2 81.4 90.2 90.5 82.1 85.3 85.5 83.9 83.3 80.4 83.8 81.8 83.3 89.2 82.2
ECFP0 3.3 1.3 2.1 2.7 1.2 1.3 1.3 4 1.4 1.2 2.9 1.3 1.4 1.8 1.4
ECFP2 86 75.8 83.1 83.1 73.6 76 76 84.7 82.7 74.4 84.5 76.5 82.7 96.2 76
ECFP4 95.1 92.6 95.7 95.7 90.8 92.4 92.4 93.5 93.1 92.1 93.3 92.4 93.1 96.6 94.8
FCFP2 25.6 16.3 20.1 21.6 15.5 16 16 28.6 16.9 15.7 38.7 20.4 16.9 19.6 16.1
FCFP4 71.5 67.5 73.7 73.8 65.5 67.3 67.3 69.2 68.5 66.3 87.6 86.7 68.5 74.4 68.1
AEs 86.7 76.2 83.5 83.6 74 76.3 76.3 85.3 83.5 74.7 85.2 76.8 83.5 97 76.5

For more results see the Main_Results.ipynb notebook.


Cite

@article{10.1186/s13321-023-00693-0, 
year = {2023}, 
title = {{Reconstruction of lossless molecular representations from fingerprints}}, 
author = {Ucak, Umit V. and Ashyrmamatov, Islambek and Lee, Juyong}, 
journal = {Journal of Cheminformatics}, 
issn = {1758-2946}, 
doi = {10.1186/s13321-023-00693-0}, 
pmid = {36823647}, 
pmcid = {PMC9948316}, 
pages = {26}, 
number = {1}, 
volume = {15}
}

License

CC BY-SA 4.0

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

CC BY-SA 4.0